Uploaded by abhishek anand

Al Bovik's Lectures on Digital Image Processing

advertisement
Al Bovik’s Lectures on
Digital Image Processing
Professor Alan C. Bovik, Director
Laboratory for Image and Video Engineering
©Alan C. Bovik
2019
1
QUICK INDEX
• Module 1 – Course Introduction, Imaging Geometry, Perception, Pixels, Perceptrons
• Module 2 – Basics: Binary and Grayscale Image Processing, Multilayer Perceptrons
• Module 3 – Fourier Transform, Image Frequencies, Sampling, RBFs and SVMs
• Module 4 – Linear Filtering, Denoising, Restoration, Wavelets, ConvNets/CNNs
• Module 5 – Image Denoising, Deep Learning, Transfer Learning, ResNet, Autoencoders
• Module 6 – Image Compression, JPEG, Deep Compression
• Module 7 – Image Analysis I: Image Quality, Edge/Shape Detection
• Module 8 – Image Analysis II: Superpixels, Search, SIFT, Face Detection
• Module 9 – Image Analysis III: Cortical Models, Pattern Analysis, Stereopsis, Deep Stereo
• Module 10 – Neural Networks for Image Processing
2
Some Notes
• These Lectures Notes are the basis for the course Digital Image Processing that I
have taught at The University of Texas at Austin since 1991.
• I modified them significantly in 2019, to capture the Deep Learning revolution
which has deeply affected image processing. This is still a process!
• They are not quite the same as those notes, since they are missing the hundreds of
live demos running digital image processing algorithms. I hope you find them
useful anyway.
• They are also missing dozens of visual illusions, of which I am a collector, since I
am uncertain of the copyright status of them.
• If you use the notes, cite them as A.C. Bovik, Al Bovik’s Lecture Notes on Digital
Image Processing, The University of Texas at Austin, 2017.
• Enjoy! Nothing is as fun as Digital Image Processing! Well, except Digital Video.
3
Module 1
Introduction





Introduction
Imaging Geometry
Visual Perception
Image Representation
Perceptrons
QUICK INDEX
4
Course Objectives
 Learn Digital Image & Video Processing
- Theory
- Algorithms and Programming
- Applications and Projects
 To have fun doing it
5
The Textbooks
 The Essential Guides to Image and Video
Processing, Al Bovik, Academic Press, 2009.
 Many chapters match the class notes
 Full of illustrations and application examples
 Many advanced chapters for projects/research
(click for appropriate cartoon)
6
SIVA Demonstration Gallery
• This course is multimedia with hundreds of slides and
dozens of live image/video processing demos.
• Demos from SIVA – The Signal, Image and Video
Audiovisual Demonstration Gallery SIVA is a collection
of didactic tools that facilitate a gentle introduction to
signal and image processing.
• Visit them at:
http://live.ece.utexas.edu/class/siva/default.htm
7
What is this Image?
View from the Window at Le Gras
(Camera obscura; bitumen of Judea on pewter;
Currently on display at the Harry Ransom Center The University of Texas at Austin)
8
Joseph Nicéphore Niépce
Saint-Loup-de-Varennes (France), 1826
9
Joseph Nicéphore Niépce
1765-1833
French Inventor
Also invented the Pyréolophore, the
first internal combustion engine!
10
Go see it at the Harry Ransom Center
(First Floor, just as you walk in to the Right!!)
11
12
Louis-Jacques-Mandé Daguerre
1787-1851
French Inventor
Inventor of the daguerreotype - the
first commercially successful photographic
process. A daguerreotype is a direct
positive on a silvered copper plate.
Also an accomplished painter and
developer of the diorama theatre!
13
14
Optical Imaging Geometry
• Assume reflection imaging with visible light.
• Let’s quantify the geometric relationship
between 3-D world coordinates and projected
2-D image coordinates.
point light source
Sensing
plate,
CCD
array,
emulsion,
etc.
emitted rays
image
object
lens
focal length
Reflected rays
15
3D-to-2D Projection
• Image projection is a reduction of dimension
(3D-to-2D): 3-D info is lost. Getting this info
back is very hard.
" f ie ld - o f - v i e w "
• It is a topic
of many years of
intensive research:
“Computer Vision”
le n s c e n te r
2 - D im a g e
16
“The image is not the object”
Rene Magritte (1898-1967)
17
Perspective Projection
• There is a geometric relationship between
3-D space coordinates and 2-D image
coordinates under perspective projection.
• We will require some coordinate systems:
18
Projective Coordinate Systems
Real-World Coordinates
• (X, Y, Z) denote points in 3-D space
• The origin (X, Y, Z) = (0, 0, 0) is the lens center
Image Coordinates
• (x, y) denote points in the 2-D image
• The x - y plane is chosen parallel to the X - Y plane
• The optical axis passes through both origins
19
Pinhole Projection Geometry
• The lens is modeled as a pinhole through which
all light rays hitting the image plane pass.
• The image plane is one focal length f from the
lens. This is where the camera is in focus.
• The image is recorded at the image plane, using a
photographic emulsion, CCD sensor, etc.
20
Pinhole camera or camera obscura
principle for recording or drawing
Concept attributed to
Leonardo da Vinci
20 minute exposure with
modern camera obscura
21
17th century camera obscura in use
22
Pinhole Projection Geometry
Z
Y
Idealized "Pinhole"
Camera Model
f = focal length
X
lens center
(X, Y, Z) = (0, 0, 0)
image plane
Problem: In this model (and in reality), the image is reversed
and upside down. It is convenient to change the model to correct this.
23
Upright Projection Geometry
Z
Y
Upright Projection Model
f = focal length
y
x
image plane
(Not to scale!)
X
lens center
(X, Y, Z) = (0, 0, 0)
• Let us make our model more mathematical…
24
y
Z
(X, Y, Z) = (A, B, C)
Y
C
B
(x, y) = (a, b)
f = focal length
x
image plane
(0, 0, 0)
A
X
• All of the relevant coordinate axes and labels …
25
A
B
a
C
b
f
• This equivalent simplified diagram shows only
the relevant data relating (X, Y, Z) = (A, B, C) to
its projection (x, y) = (a, b).
26
Similar Triangles
• Triangles are similar if their
corresponding angles are equal:






similar triangles
27
Similar Triangles Theorem
• Similar triangles have their side lengths in
the same proportions.

D
d

F
E


f


e
D=d
E e
F =f
D d
E=e
F f
etc
28
Solving Perspective Projection
• Similar triangles solves the relationship between
3-D space and 2-D image coordinates.
• Redraw the geometry once more, this time
making apparent two pairs of similar triangles:
29
A
B
B
a
b
f
C
C
b
A
f
a
C
f
• By the Similar Triangles Theorem, we
conclude that
a=A b=B
f
C
f
C
f
• OR: (a, b) =
· (A, B) = (fA/C, fB/C)
C
30
Perspective Projection Equation
• The relationship between a 3-D point
(X, Y, Z) and its 2-D image (x, y) :
(x, y) = f · (X, Y)
Z
where f = focal length
• The ratio f/Z is the magnification factor,
which varies with the range Z from the lens
center to the object plane.
31
Straight Lines Under
Perspective Projection
• Why do straight lines (or line segments) in
3-D project to straight lines in 2-D images?
• Not true of lenses (e.g. "fish-eye“) that do
not obey the pinhole approximation.
y
Z
3-D line
Y
2-D line
image plane
To show this to be true, one could
write the equation for a line in
3-D, and then project it to the
equation of a 2-D line…
x
(0, 0, 0)
X
32
• Easier way:
• Any line touching the lens center and the 3-D line
are in the same plane (a point and a line define a
plane).
• The intersection of this plane with the image plane
gives the projection of the line.
• The intersection of two (nonparallel) planes is a line.
• So, the projection of a 3-D line is a 2-D line.
3-D line
2-D line
•In image analysis (later), this
property makes finding straight
lines much easier!
•This property of lenses makes it
easier to navigate for us to
navigate (Click for an example!).
33
Cameras are Now Computers
34
Steve Sasson and the First
Digital Camera
Fascinating history of the
Digital Camera HERE
35
A/D Conversion
• Sampling and quantization.
• Sampling is the process of creating a signal
that is defined only at discrete points, from
one that is continuously defined.
• Quantization is the process of converting
each sample into a finite digital
representation.
36
Sampling
• Example: An analog video raster converted
from a continuous voltage waveform into
a sequence of voltage samples:
continuous electrical signal from one scanline
sampled electrical signal from one scanline
37
Sampled Image
• A sampled image is an array of numbers (row,
column) representing image intensities
columns
rows
depiction of 10 x 10 image array
• Each of these picture elements is called a
pixel.
38
Sampled Image
• The image array is rectangular (N x M), often with
dimensions N = 2P and M = 2Q (why?)
• Examples: square images
•
•
•
•
P=Q=7
P=Q= 8
P=Q= 9
P=Q=10
128 x 128
256 x 256
512 x 512
1024x1024
1920x1080
(216 ≈ 16,000 pixels)
18
(2 ≈ 65,500 pixels)
(220 ≈ 262,000 pixels)
(214 ≈ 1,000,000 pixels)
(= 2,073,600 pixels)
39
Sampling Effects
• It is essential that the image be sampled
sufficiently densely; else the image quality
will be severely degraded.
• Can be expressed via the Sampling
Theorem) but the visual effects are most
important (DEMO)
• With sufficient samples, the image appears
continuous…..
40
Sampling in Art
Seurat - La Grande Jatte – Pointillist work took 2 years to create
41
42
43
44
Quantization
• Each gray level is quantized: assigned an
integer indexed from 0 to K-1.
• Typically K = 2B possible gray levels.
• Each pixel is represented by B bits, where
usually 1 ≤ B ≤ 8.
a pixel
8-bit representation
45
Quantization
• The pixel intensities or gray levels must be
quantized sufficiently densely so that
excessive information is not lost.
• This is hard to express mathematically, but
again, quantization effects are visually
obvious (DEMO)
46
Image as a Set of Bit Planes
Bit Plane 1
Bit Plane 2
Bit Plane B
47
The Image/Video Data Explosion
• Total storage for 1 digital image with 2P x
2Q pixels spatial resolution and B bits / pixel
gray-level resolution is B x 2P+Q bits.
• Usually B=8 and often P=Q=10. A
common image size is 1 megabyte.
• Ten years ago this was a lot. These days
digital cameras produce much larger
images.
48
The Image/Video Data Explosion
• Storing 1 second of a 512x512 8-bit gray-level
movie (TV rate = 30 images / sec) requires 30
Mbytes.
• A 2-hour color theatre-quality raw 4K digital
video: (3 bytes/color pixel) x (4096x2160
pixels/frame) x (60 frames/sec) x (3600 sec/hour)
x (2 hours) requires 11.5 terabytes of storage.
That's a lot today.
• Later, we will discuss ways to compress digital
images and videos.
49
A Bit About Visual Perception
• In most cases, the intended receiver of the
result of image/video processing or
communications algorithms is the human eye.
• A fair amount is known about the eye. It is
definitely a digital computation device:
- the neurons (rods, cones) sample and quantize
- the retinal ganglion and cortical cells linearly filter
50
The Eye - Structure
178,000-238,000 cones/mm
300
Rods
100
Cells
per
degree
cones
10
Ganglion Cells
3
-40
-20
0
20
Eccentricity (deg)
1.5 mm
•Notice that image sampling
at the retina is highly nonuniform!
51
40
An example of
“foveated” art
Madame Henriot
Pierre-August
Renoir
52
Eye Movement
 The eyes move constantly, to place/keep the fovea
on places of interest.
 There are five major types of eye movement:
- saccadic (attentional)
- pursuit (smooth tracking)
- vestibular (head movement compensating)
- microsaccadic (tiny; image persistency)
- vergence (stereoscopic)
To demonstrate microsaccades, first fixate the center of
the white dot for 10 sec, then fixate the small black dot.
Small displacememts of the afterimage
are then obvious -- the slow drifting movements as
well as the corrective microsaccades.
53
Saccades and Fixations
Highly contextual
Eyes tracked using the “scleral coil” …
Less contextual
54
Visual Attention
 Eyes movements are largely about visual attention.
 Attention is where conscious thought is directed
 Usually (not always) towards the point of visual
fixation
 Related to the task the individual is engaged in
 Not easy to focus attention while engaged in
complex tasks … try this: Attention Video.
 What about this poor fellow … Door Video.
55
Visual Eyetracking
 Inexpensive ET’s are becoming fast and
accurate. We have several.
 Soon could be packaged with monitors,
video communications devices, etc.
 Basic technology: IR radiation reflected
from highly reflective retina and cornea.
56
Visual Eyetrackers
• Typical resolution: <0.5° (visual angle) at
60 Hz.
• Ergonomically friendly. Requires 30 sec
calibration - some new systems forego this.
57
Visual Eyetracking
Desktop model
Tracked eye
Wearable model
58
Dual Purkinje Eyetracker
• Dual Purkinje: Accurate to one minute of
arc at 400 Hz - higher cost, less convenience.
• Measures positional difference between the
1ST Purkinje reflection (front of cornea) and
4TH Purkinje reflection (rear of crystal
lens).
59
Dual Purkinje Eyetracker
SRI Generation V Dual Purkinje Eyetracker
60
Precision Recorded Eye
Movements
61
Visual Limits
• When designing image/video algorithms, it
is good to know the limitations of the
visual system.
- spatial and temporal bandwidths
- resolving power
- color perception
- visual illusions – the eye is easily fooled!
62
Contrast Sensitivity Function
• Back in the 1960’s Campbell and Robson1
conducted psychophysical studies to
determine the human frequency response.
• Known as contrast sensitivity function
1F.W.
Campbell and J.G. Robson, “Application of fourier
analysis to the visibility of gratings,” Journal of Physiology,
1968. HERE
63
Michaelson Contrast
• Given any small image patch, the
Michelson Contrast of that patch is
Lmax  Lmin
Lmax  Lmin
• Lmax and Lmin are the max and min
luminances (brightnesses) over the patch
64
Spatial Sine Wave Gratings
• Sine wave grating (0 < C < 1)
C sin(Ux+Vy ) + 1
Contrast = C
• (U, V) = spatial frequency in (x, y) directions
• Orientation = Tan-1(V/U)
• Radial (propagating) frequency SQRT(U2+V2)
65
Campbell & Robson Experiments
• Campbell & Robson showed human subjects sine
wave gratings of different frequencies and
contrasts and recorded their visibility.
• They argued that the human visual system does
Fourier analysis on retinal images
66
Viewing Angle
• Contrast sensitivity was (and is) recorded
as a function of viewing angle.
Image
Plane
Retina
Foveal
region

lens
center
67
A Campbell-Robson Grating
• Contrast increases downward. Frequency
increases rightward. Helps visualize loss of
visibility as function of frequency/contrast.
68
contrast
spatial frequency
10
normalized sensitivity
Compare
the C-R
Grating
with the
CSF
10
0
-1
-2
10
10
-1
10
0
10
1
spatial frequency (cycles/degree)
10
2
69
Contrast Sensitivity Function
• The typical human contrast sensitivity
function (CSF) is band-pass
• Peak response around 4 cycles per degree
(cy/deg), dropping off either side.
70
Why is the CSF Important for
Image Processing?
• Firstly, because we display pixels on
today’s digital displays - individual pixels
aren’t distinguishable (at a distance).
• Secondly, it enables considerable image
compression.
71
Image Sampling and the CSF
• Digital images are spatially sampled
continuous light fields.
• How many samples are needed depends on
the sampling theorem, on the CSF, and on
viewing distance.
• With sufficient samples, and at adequate
distance, the image appears continuous. 72
What About Color?
• Color is an important aspect of images.
• A color image is a vector-valued signal. At each pixel,
the image has three values: Red, Green, and Blue.
• Usually expressed as three images: the Red, Green and
Blue images: RGB representation.
• Although color is important, in this class we will
usually just process the intensity image I = R + G + B.
• Many color algorithms process R, G, B components
separately like gray-scale images then add the results.
73
Color isColor
Important!
• Any color may be represented as a mixture of Red (R), Green
(G) and Blue (B). RGB codes color video as three separate
signals: R, G, and B.
• This is the representation captured by most color optical
sensors.
The Boating Party - Renoir
• Color is important although perhaps not necessary for
survival.
74
Color Sensing
Raw RGB
• The sensor itself is not specific to color.
• Instead a color filter array (CFA) is superimposed
over the sensor array.
• The most common is the Bayer array.
• Twice as many Green-tuned filters as Red-tuned or
Blue-tuned – green constitutes more of the real-world
visible spectrum, and the eyes are more sensitive to
green wavelengths.
75
Color Sensing: Bayer CFA
• Each sensor pixel is thus Red or Green or Blue, but
there are no RGB pixels (yet).
• These are often distinguished as “Raw RGB” vs
RGB.
76
Demosaicking
(Color Interpolation))
• What is desired is a Red and Green and Blue
value at every pixel of an RGB image.
• This is generally done by interpolating the
known R, G, and B values to fill the places where
it is unknown.
• A plethora of methods exist, and no standard, but
usually as simple as replicating or averaging the
nearest relevant values.
77
Simple Demosaicking
Interpolating Green
Interpolating Blue
• Simply average
values to “fill
in.”
Interpolating Red
Interpolating Red
78
RGB Image
Result: RGB Color Planes
RGB Color Space
• A very large number of colors can be represented.
79
RGB
Color
R
G
B
Intensity
80
Why RGB?
• Can represent any color with RGB. There is the usual
notion of “primaries” (like painting but the “primaries”
are different.
• However, can create color spaces using other
“wavelength primaries” or without using colors at all.
81
Tristimulus Theory
• The cones in the center of the retina (fovea) of the
human eye can be classed according to their
wavelength bandwidths.
M cones
L cones
S cones
• Roughly Red (L = long), Green (M = medium) and
Blue (S = short) sensitive cone cells.
82
Tristimulus Theory
• Given the known cone sensitivities it was long thought that
the eye-brain system separately processed Red, Green and
Blue channels.
• However, RGB is not bandwidth efficient (each color uses
the same BW).
• The three RGB channels contain highly redundant
information. The brain (and modern video processing systems)
exploit these redundancies in many ways.
• One way is through “color opponency”.
83
Photopic and Scotopic Vision
Photopic (normal daylight)
wavelength sensitivity
Scotopic (normal nighttime)
wavelength sensitivity
• However, the rods do not distinguish wavelengths into color
channels
84
Color Opponent Theory
• The visual system processes luminance (brightness) and color
separately. Practical opponent color space model approximates
this. One simple one is:
Y = luminance = aR + bG + cB
(a + b + c = 1)
U = K[Red (B)– luminance (Y)] = K2(B – Y)
V = K[Blue (R) – luminance (Y)] = K2(R –Y)
• This exploits redundancies since the difference signals will
have much smaller entropies (cluster more tightly around the
origin) and hence are more compressible.
85
Analog Color Video
• YUV is an older analog color space. It defines brightness in terms of wavelength
sensitivity. Here is a common definition:
Luminance:
Y = 0.299R + 0.587G + 0.114B
Chrominance:
U = -0.147R - 0.289G + 0.436B ≈ 0.492(B-Y)
V = 0.615R - 0.515G - 0.1B ≈ -0.877(R-Y)
• Ideas:
– Structural info is largely carried by luminance.
– Cones have highest sensitivity to G, then R, then B wavelengths - for
evolutionary reasons.
– Redundancy between luminance and chrominance is exploited by differencing
them.
– Notice that U = V = 0 when R = G = B.
86
RGB  YCrCb Examples
RGB
Y
U
V
87
Digital Color Images
• YCrCb is the modern color space used for digital images and videos.
• Similar but simpler definition:
Y = 0.299R + 0.587G + 0.114B
Cr = R  Y
Cb = B  Y
•
Used in modern image and video codecs like JPEG and H.264.
•
Often the terms “YUV” and “YCrCb” are used interchangeably.
•
Why use YUV / YCrCb? Reduced bandwidth. Chrominance info can be sent
in a fraction of the bandwidth of luminance info.
•
In addition to color information being “lower bandwidth,” the chrominance
components are entropy reduced.
88
Color Constancy
• A property of visual perception that ensures that the
perceived color of objects remains relatively constant
under varying illumination conditions.
• For an example, an apple looks red (or green) both in the
white light of mid-day and in the redder light of evening.
• This helps with the recognition of objects.
89
Color Display
XO-1 = “$100 Laptop”
(One Laptop Per Child)
Basic idea: each approximately square,
rectangular, or ovoid pixel is composed of
three neighboring R, G, and B “sub-pixels.”
90
Visual Illusions
• Visual illusions are excellent probes into how we see.
• They reveal much about the eye, how vision adapts, and
finding where it “goes wrong” often confirms or explains
models of vision.
• They are also a great reminders that “what we see is not
reality.”
• We will be seeing visual illusions throughout the course.
91
Color Illusions
92
Remember the Internet
Sensation?
Is the middle dress gold and
white, or is it blue and black?
93
Here’s the
same dress
under other
lighting
conditions
Happens because of ‘color
constancy’ – the vision system
tries to see a color the same way
under different lighting.
94
Digital Image Representation
• Once an image is digitized it is an array of
voltages or magnetic potentials.
• Algorithms access a representation that is
a matrix of numbers – usually integers, but
possibly float or complex.
95
Image Notation
• Denote an image matrix
I = [I(i, j); 0 ≤ i ≤ N-1, 0 ≤ j ≤ M-1 ]
where
(i, j) = (row, column)
I(i, j) = image value at (i, j)
I(0, 0)
I(1, 0)
I=
·
·
·
I(N-1, 0)
I(0, 1) · · · I(0, M-1)
I(1, 1) · · · I(1, M-1)
·
·
·
·
·
·
I(N-1, 1) · · · I(N-1, M-1)
96
Common Image Formats
• JPEG (Joint Photographic Experts Group) images are compressed
with loss – see Module 7. All digital cameras today have the option to
save images in JPEG format. File extension: image.jpg
• TIFF (Tagged Image File Format) images can be lossless (LZW
compressed) or compressed with loss. Widely used in the printing
industry and supported by many image processing programs. File
extension: image.tif
• GIF (Graphic Interchange Format) an old but still-common format,
limited to 256 colors. Lossless and lossy (LZW) formats. File
extension: image.gif
• PNG (Portable Network Graphics) is the successor to GIF. Supports
true color (16 million colors). Somewhat new - not yet widely
supported. File extension: image.png
• BMP (bit mapped) format is used internally by Microsoft Windows.
Not compressed. Widely accepted. File extension: image.jbmp
97
Perceptrons
98
The Perceptron
• The first “neural network”
• It is a binary classifier (outputs ‘0’ or ‘1’) –
i.e., it makes decisions.
• Decisions are computed in two steps:
– A linear combination of input values
– A nonlinear thresholding or ‘activation’ function
99
Perceptron: Linear Step
• Given an image
I = {I(i, j); 0 < i, j < N-1, M-1
• Define any subset i  I which we will
vectorize:
i = {i(p); 1 < p < P}
• This could be the entire image, or a block, or a
segmented region, for example.
100
Perceptron: Linear Step
• Define a set of weights
w = {w(p); 1 < p < P}
then the linear step is simply (inner product)
P
w T i   w(p)i(p)
p 1
• We could have done this with 2D weights W
M 1 P 1
W T I   W(m,p)I(m,p)
m 0 p 0
but the 1D version takes less space, and the original
ordering is implicit in the 1D vector.
101
Perceptron: Nonlinear Step
• Given a bias value b, then the nonlinear binary
thresholding function
 1; w T i  b
T
f (i )  

sign
w
i  b

T
1; w i  b
is an example of an activation function. The activation
function sign() is called the signum function.
• A model of a single neuron which is only activated by
a strong enough input.
102
Perceptron: Training
• Need a set of Q training sample images with labels or
targets:
T =  i1 , t1  ,  i 2 , t 2  ,...,  i Q , t Q 


• The training (sub) images iq are labeled with correct
decisions/labels tq.
• Given weights w there is no training error if and only
T
if
t w i b  0
q

for all q = {1, 2, …, Q}.
q

103
Perceptron: Optimization
• Goal: Find weights that minimize the loss function
Q
L(w,b)    t q  w T i q  b 
q 1
• It turns out that this can be zero only if the training set
is linearly separable:
{dq=0}
{dq=1}
Linearly Separable
(by a line or hyperplane)
Linearly Inseparable
104
Gradient of Cost
• The gradient is the vector of partial derivatives
 L L
L 
L(w,b)  
,
,...,

w P 
 w1 w 2
T
L( w, b)
• We will also use
b
• Basic idea: optimize the perceptron by
iteratively following the gradient down to a
(possible) minimum.
105
Gradient Descent
• The optimal weights w* must satisfy L(w, b)  0
and the Hessian matrix
  2 L(w ,b)
 w 2
1

  2 L(w ,b)

 w 2w1


 2
  L(w ,b)
 w P w1
 2 L(w, b)
 2 L(w ,b) 

w1w 2
w1w P 

 2 L(w, b)
 2 L(w ,b) 


2
w 2
w 2 w P 





2
2
 L(w, b)
 L(w ,b) 

w P w 2
w P2 
at w* must be positive definite:
d H  L(w*, b)  d  0 for any vector d 
T
P
106
Gradient Descent
• Gradient descent: a simple algorithm that iteratively
moves current solution in the direction of the gradient:
w (n 1)  w (n )  γL  w (n ) 
or (exercise!)
b
(n 1)
b
(n)
L(w (n ) ,b)
γ
b
w (n 1)  w (n)  γt q i q
b
(n 1)
b
(n )
 γt q
q = 1, 2, …., Q
given an initial guess (say w(0) = 0 or random).
• Here 0 is the learning rate. If too big it might not
converge; too small, might converge very slowly). 107
Problems with Perceptron
• First, there no lower bound to the loss function! Makes
numerical solution hard.
• Rosenblatt’s Solution: Only iterate on those training
images whose solution condition is violated at each
(n 1)
(n)
step:
w
 w  γt i
b
(n 1)
b
(n )
 γt q
q q


for those q’ = 1, 2, …., Q such that t q w i q  b  0.
T
• This does converge but only if the training set is
separable. Which is rare, especially on hard problems!
108
Can Converge Anywhere
in “the Margin”
• The solution converged to is not necessarily unique!
• It can lie on any line (hyperplane) separating the linearly
separable classes.
109
Training Iteration
• Iterate by repeatedly applying every image in the training set
(one “epoch”):
f
(n)
(i q )  sign  w
(n)T
iq  b
(n)

• Update the weights w on each training sample.
• Standard Gradient Descent: Iterate through i1, i2, …, iQ in
indexed order each epoch.
• Stochastic Gradient Descent: Randomly re-order the training
set {iq} after every epoch.
• Iterate a fixed number of epochs, or alternately until an error
f q(n)  t q
is small enough (for some norm such as MSE).
110
Perceptron Diagram
• Given a (vectorized) image or image piece:
i = {i(p); 1 < p < P}
a bias b, and a trained set of weights
w = {w(p); 1 < p < P}
i(1)
w(1)
i(2)
w(2)
input i
···
i(3)
i(P)
input layer
(passive nodes)
···
w(3)
w(P)
weights

-b
bias
output y
activation
function
output layer
111
Perceptron Diagram
• Convenient to redraw this way (notice the modified unit bias)
w(1)
i(1)
w(2)
i(2)

w(3)
···
···
i(3)
i(P)
input layer
bias
w(P)
weights
1
activation
function
•
Input layer includes a
constant bias
•
Output layer: sum and
activation function
•
Very common notation
for general neural
networks
output layer
• The –b can be absorbed into the other weights by
normalization (more later).
• Sometimes the bias is drawn as another (unit) input.
112
Simplified Perceptron Diagram
• If the activation function is understood (and we will be
considering other ones!), can just draw this:
w(1)
i(1)
w(2)
i(2)
output y
input i
•
Input layer includes a
constant bias
•
Output layer: sum,
including the bias and
activation function
•
Very common notation
for vanilla neural
networks
···
i(3)
i(P)
input
layer
···
w(3)
w(P)
weights
output
layer
113
Image Data
• Let’s be clear: perceptrons were not used for image
processing!
• Certainly no feeding of pixels into perceptrons to do
anything useful.
• There weren’t even many digital images.
• Also, the computation would have been far beyond
formidable back then.
• HOWEVER, at some point we will be feeding pixels into
learners, so it is good to cast things that way.
114
Promise of Perceptrons?
In 1958 pioneer Frank Rosenblatt said: “the perceptron is the
embryo of an electronic computer that will …. be able to walk, talk,
see, write, reproduce itself, and be conscience of its existence.”
• There were/are significant
problems with perceptrons.
• But today, it is starting to
look like he was right!
115
Failure of Perceptrons
• In the 1960s it was found that perceptrons did not have much
ability to represent even simple functions or data patterns.
• Marvin Minsky and Seymour Papert showed it could not even
represent the XOR function.
• Never really used for image analysis (no “digital” images back
then, far too complex).
• Interest languished for a decade or more, hence thus ended the
“first wave” of neural networks.
116
Comments on Perceptrons
• We will discuss concepts like validation sets and test sets as we
study more complex networks.
• Perceptrons were the first neural networks. They were the
first feed-forward neural networks (the only kind we’ll study
here).
• With modifications, they are the basis of today’s modern deep
convolutions networks (“ConvNets” or CNNs).
• Things changed when researchers began to layer the networks.
117
Comments
• With this broad overview of topics related
to image processing in hand, and a start on
neural networks, we can proceed …
onward to Module 2…..
118
Module 2
The Basics: Binary Images &
Point Operations





Binary Images
Binary Morphology
Point Operations
Histogram Equalization
Multilayer Perceptrons
QUICK INDEX
119
Binary Images
• A digital image is an array of numbers:
sampled image intensities:
A 10 x 10 image
• Each gray level is
quantized: assigned
one of a finite set of
numbers [0,…,K-1].
• K = 2B possible gray
levels: each represented
by B bits.
columns
rows
• Binary images
have B = 1.
120
Binary Images
A 10 x 10 binary image
• In binary images the (logical) values '0' and '1' often
indicate the absence/presence of an image property in
an associated gray-level image:
- High vs. low intensity (brightness)
- Presence vs. absence of an object
- Presence vs. absence of a property
Example: Presence
of fingerprint ridges
121
Gray-level Thresholding
• Often gray-level images are converted to
binary images.
• Advantages:
- B-fold reduction in required storage
- Simple abstraction of information
- Fast processing - logical operators
122
Binary Image Information
• Artists have long understood that binary
images contain much information: form,
structure, shape, etc.
“Don Quixote” by Pablo Picasso
123
Simple Thresholding
• The simplest image processing operation.
• An extreme form of quantization.
• Requires the definition of a threshold T
(that falls in the gray-scale range).
• Every pixel intensity is compared to T, and
a binary decision rendered.
124
Simple Thresholding
• Suppose gray-level image I has
K gray-levels: 0, 1,...., K-1
• Select a threshold 0 < T < K-1.
• Compare every gray-level in I to T.
• Define a new binary image J as follows:
J(i, j) = '0' if I(i, j) ≥ T
J(i, j) = '1' if I(i, j) < T
I
Threshold
T
J
125
Threshold Selection
• The quality of the binary image J from
thresholding I depends heavily on the threshold T.
• Different thresholds may give different valuable
abstractions of the image – or not!
• How does one decide if thresholding is possible?
How does one decide on a threshold T ?
126
Gray-Level Image Histogram
• The histogram HI of image I is a graph of
the gray-level frequency of occurrence.
• HI is a one-dimensional function with
domain 0, ... , K-1.
• HI(k) = n if I contains exactly n
occurrences of gray level k; k = 0, ... K-1.
127
Histogram Appearance
• The appearance of a histogram suggests
much about the image:
H (k)
I
H (k)
I
0
gray level k
K-1
Predominantly dark image
0
gray level k
K-1
Predominantly light image
• Could be histograms of underexposed and
overexposed images, respectively.
128
Histogram Appearance
• This histogram may show better use of the
gray-scale range:
H (k)
I
0
gray level k
K-1
HISTOGRAM DEMO
129
Bimodal Histogram
• Thresholding usually works best when there
are dark objects on a light background.
• Or when there are light objects on a dark
background.
• Images of this type tend to have histograms
with distinct peaks or modes.
• If the peaks are well-separated, threshold
selection is easier.
130
Bimodal Histogram
H (k)
I
bimodal histogram
poorly separated
0
gray level k
K-1
H (k)
I
bimodal histogram
well separated peaks
0
gray level k
K-1
Where to set the threshold in these two cases?
131
Threshold Selection from Histogram
• Placing threshold T between modes may
yield acceptable results.
• Exactly where in between can be difficult
to determine.
threshold
T
H (k)
I
threshold
selection
0
gray level k
K-1
132
Multi-Modal Histogram
• The histogram may have multiple modes.
Varying T will give very different results.
T?
T?
multi-modal
histogram
H (k)
I
0
K-1
gray level k
133
Flat Histogram
• The histogram may be "flat," making
threshold selection difficult:
H (k)
I
flat histogram
0
gray level k
K-1
Thresholding DEMO
134
Discussion of Histogram Types
• We'll use the histogram for gray-level processing.
Some general observations:
- Bimodal histograms often imply objects and
background of different average brightness.
- Easier to threshold.
- The ideal result is a simple binary image showing
object/background separation, e.g, - printed type;
blood cells in solution; machine parts.
• The most-used method is Otsu’s (here) based on
maximizing the separability of object and
background brightness classes. But it gives errors as
much as other methods.
135
Histogram Types
• Multi-modal histograms often occur in images of
multiple objects of different average brightness.
• “Flat” or level histograms imply more complex
images having detail, non-uniform backgrounds, etc.
• Thresholding rarely gives perfect results.
• Usually, region correction must be applied.
136
Illusions of Shape, Size
and Length
137
Kanisza Triangle
What tabletop is bigger?
Watch the video!
Which lines are longer?
138
Coming or going?
139
Binary Morphology
• A powerful but simple class of binary
image operators.
• General framework called mathematical
morphology
morphology = shape
• These affect the shapes of objects and
regions.
• All processing done on a local basis.
140
• Morphological operators:
- Expand (dilate) objects
- Shrink (erode) objects
- Smooth object boundaries and
eliminate small regions or holes
- Fill gaps and eliminate 'peninsulas‘
• All is accomplished using local logical
operations
141
Structuring Elements or
“Windows”
• A structuring element or window defines
a geometric relationship between a pixel
and its neighbors. Some examples:
142
Windows
• Conceptually, a window is passed over the image,
and centered over each pixel along the way.
• Usually done row-by-row, column-by-column.
• A window is also called a structuring element.
• When centered over a pixel, a logical operation on
the pixels in the window gives a binary output.
• Usually a window is approximately circular so that
object/image rotation won’t effect processing.
143





.
.
.
.
.
.


A structuring
element moving
over an image.

144
Formal Definition of Window
• A window is a way of collecting a set of
neighbors of a pixel according to a geometric rule.
• Some typical (1-D row, column) windows:
COL(3)
ROW(3)
COL(5)
ROW(5)
1-D windows: ROW(2M+1) and COL(2M+1)
• Windows almost always cover an odd number of
pixels 2M+1: pairs of neighbors, plus the center
pixel. Then filtering operations are symmetric.
145
• Some 2-D windows:
SQUARE(9)
SQUARE(25)
CROSS(5)
CIRC(13)
CROSS(9)
2-D windows:
SQUARE(2P+1), CROSS(2P+1), CIRC(2P+1)
146
Window Notation
• Formally, a window B is a set of coordinate
shifts Bi = (pi, qi) centered around (0, 0):
B = {B1, ..., B2P+1} = {(p1, q1), ..., (p2P+1, q2P+1)}
Examples - 1-D windows
B = ROW(2P+1) = {(0, -P), ..., (0, P)}
B = COL(2P+1) = {(-P, 0), ..., (P, 0)}
For example, B = ROW(3) = {(0, -1), (0, 0), (0, 1)}
147
2-D Window Notation
B = SQUARE (9) = {(-1, -1) , (-1, 0), (-1, 1),
(0, -1) , (0, 0), (0, 1),
(1, -1) , (1, 0), (1, 1)}
B = CROSS(2P+1) = ROW(2P+1)  COL(2P+1)
For example, B = CROSS(5) = {
(-1, 0),
(0, -1), (0, 0), (0, 1),
(1, 0)
}
148
The Windowed Set
• Given an image I and a window B, define the
windowed set at (i, j) by:
BI(i, j) = {I(i-p, j-q); (p, q) B}
the pixels covered by B when centered at (i, j).
• This formal definition of a simple concept will
enable us to make simple and flexible definitions
of binary filters.
149
• B = ROW(3):
BI(i, j) = {I(i, j-1) , I(i, j), I(i, j+1)}
• B = COL(3):
BI(i, j) = {I(i-1, j) , I(i, j), I(i+1, j)}
• B = SQUARE(9):
BI(i, j) = {I(i-1, j-1) , I(i-1, j), I(i-1, j+1),
I(i, j-1) ,
I(i, j),
I(i, j+1),
I(i+1, j-1) , I(i+1, j), I(i+1, j+1)}
• B = CROSS(5):
BI(i, j) = {
I(i-1, j),
I(i, j-1), I(i, j), I(i, j+1),
I(i+1, j)
}
150
General Binary Filter
• Denote a binary operation G on the windowed
set BI(i, j) by
J(i, j) = G{BI(i, j)} = G{I(i-p, j-q); (p, q)  B}
• Perform this at every pixel in the image,
giving filtered image
J = G[I, B] = [J(i, j); 0 ≤ i ≤ N-1, 0 < j < M-1]
151
Edge-of-Image Processing
• What if a window overlaps "empty space" ?
Our convention: fill the
"empty" window slots with
the nearest image pixel.
This is called replication.
152
Dilation and Erosion Filters
• Given a window B and a
binary image I:
J = DILATE(I, B)
if
J(i, j) = OR{BI(i, j)}
= OR{I(i-p, j-q); (p, q) B}
• Given a window B and a binary
image I:
J = ERODE(I, B)
if
J(i, j) = AND{BI(i, j)}
= AND{I(i-p, j-q); (p, q) B}
153
Erosion
Dilation
• DILATION
increases the size of
logical ‘1’ (usually
black) objects.
Examples of local
DILATION &
EROSION
computations.
• EROSION
decreases the
size of logical ‘1’
objects.
OR
AND
= logical ‘1’
I
J
B =
= logical ‘0’
= logical ‘1’
changed by
DILATION
or
EROSION
I
J
B
=
154
Interpreting Dilation & Erosion
Global interpretation of
DILATION:
It is useful to think of the
structuring element as
rolling along all of the
boundaries of all BLACK
objects in the image.
The center point of the
structuring element
traces out a set of paths.
That form the boundaries
of the dilated image.
Global interpretation of
EROSION:
It is useful to think of the
structuring element as
rolling inside of the
boundaries of all BLACK
objects in the image.
EROSION
&
DILATION
DEMO
The center point of the
structuring element
traces out a set of paths.
That formthe boundaries
of the eroded image.
155
Qualitative Properties of
Dilation & Erosion
Dilation removes holes of
too-small size and gaps or
bays of too-narrow width:
Erosion removes objects of
too-small size and peninsulas
of too-narrow width:
DILATE
ERODE
DILATE
ERODE
156
Majority or “Median” Filter
• Given a window B and a binary image I:
J = MAJORITY(I, B)
if
J(i, j) = MAJ{BI(i, j)}
= MAJ{I(i-p, j-q); (p, q)  B}
• Has attributes of dilation and erosion, but
doesn’t change the sizes of objects much.
157
Majority/Median Filter
A
C
MAJ
B
I
B =
J
The majority removed the small object A and the small hole
hole B, but did not change the boundary (size)
of the larger region C.
Majority Filter DEMO
158
Qualitative Properties of Majority
• Median removes both objects and holes of
too-small size, as well as both gaps (bays)
and peninsulas of too-narrow width.
MAJORITY
MAJORITY
159
3-D Majority Filter Example
• The following example is a 3-D Laser
Scanning Confocal Microscope (LSCM)
image (binarized) of a pollen grain.
Magnification  200x
Examples of 3-D windows: CUBE(125) and CROSS3-D(13)
160
LSCM image of pollen grain
161
Pollen grain image filtered with CUBE(125)
binary majority filter
This could be 3D printed …
162
And we did! One of the first
3D printing jobs ever, back in
the late 1980s.
It was then called “selective
laser sintering”
163
OPEN and CLOSE
• Define new morphological operations by
performing the basic ones in sequence.
• Given an image I and window B, define
OPEN(I, B) = DILATE [ERODE(I, B), B]
CLOSE(I, B) = ERODE [DILATE(I, B), B]
164
OPEN and CLOSE
• In other words,
OPEN = erosion (by B) followed by dilation (by B)
CLOSE = dilation (by B) followed by erosion (by B)
• OPEN and CLOSE are very similar to MEDIAN:
- OPEN removes too-small objects/fingers (better than
MEDIAN), but not holes, gaps, or bays.
- CLOSE removes too-small holes/gaps (better than
MEDIAN) but not objects or peninsulas.
- OPEN and CLOSE generally do not affect object size.
- Thus OPEN and CLOSE are highly biased smoothers.
165
OPEN and CLOSE vs. Majority
OPEN
MAJORITY
CLOSE
OPEN and CLOSE DEMO
166
OPEN-CLOSE and CLOSE-OPEN
• Effective smoothers obtained by sequencing OPEN and
CLOSE:
OPEN-CLOS(I, B) = OPEN [CLOSE (I, B), B]
CLOS-OPEN(I, B) = CLOSE [OPEN (I, B), B]
• These operations are similar (but not identical). They are
only slightly biased towards 1 or 0.
167
Application Example
Simple Task: Measuring Cell Area
(i) Find general cell region by thresholding
(ii) Apply region correction (clos-open)
(iii) Display cell boundary for operator verification
(iv) Compute image cell area by counting pixels
(v) Compute true cell area via perspective projection
168
Measuring Cell Area
Conventional Optical Microscope
169
Cell Area Measurement
Example #1
(a)
(b)
(c)
(d)
Cellular mass
Thresholded
Region corrected
Computed
boundary overlaid
170
Cell Area Measurement
Example #2
(a)
(b)
(c)
(d)
Cellular mass
Thresholded
Region corrected
Computed
boundary overlaid
171
Comments
• Many things can be accomplished with
binarized images – since shape is often
well-preserved.
• However, gray-scales are also important –
next, we will deal with gray scales – but
not shape.
172
Gray-Level
Point Operations
173
Brightness Illusions
174
Mach Bands
Count the black dots
Afterimages
175
176
Squares
A and B
are
identical!
(Ed
Adelson)
177
Let’s play … do you want the white or black pieces?
178
Simple Histogram Operations
• Recall: the gray-level histogram HI of an image I
is a graph of the frequency of occurrence of each
gray level in I.
• HI is a one-dimensional function with domain 0,
... , K-1:
• HI(k) = n if gray-level k occurs (exactly) n
times in I, for each k = 0, ..., K-1.
179
H (k)
I
0
K-1
gray level k
• HI contains no spatial information - only the
relative frequency of intensities.
• Much useful info is obtainable from HI, such as
average brightness:
1 K-1
1 N-1M-1
L AVE (I) =
I(i,j) =
kH I (k)



NM i=0 j=0
NM k=0
• Image quality is affected (enhanced, modified) by
altering HI.
180
Average Brightness
• Examining the histogram can reveal
possible errors in the imaging process:
underexposed
Low LAVE
overexposed
High LAVE
• By operating on the histogram, such errors
can be ameliorated.
181
Point Operations
• Point operation: a function f on single pixels in I:
J(i, j) = f[I(i, j)], 0 ≤ i ≤ N-1, 0 ≤ j ≤ M-1
• The same function f applied at every (i, j).
• Does not use neighbors of I(i, j).
• They don’t modify spatial relationships.
• They do change the histogram, and therefore the
appearance of the image.
182
Linear Point Operations
• The simplest class of point operations. They offset
and scale the image intensities.
• Suppose -(K-1) ≤ L ≤ K-1. An additive image
offset is defined by
J(i, j) = I(i, j) + L
• Suppose P > 0. Image scaling is defined by
J(i, j) = P·I(i, j)
183
Image Offset
• If L > 0, then J is a brightened version of I.
If L < 0, a dimmed version of I.
• The histogram is shifted by amount L:
HJ(k) = HI(k-L)
Original
L>0
(DEMO)
L<0
Shifted by L
184
Image Scaling
•
•
•
•
J(i, j) = P·I(i, j)
If P > 1, the intensity range is widened.
If P < 1, the intensity range is narrowed.
Multiplying by P stretches or compresses
the image histogram by a factor P:
HJ(k) = HI(k/P)
(continuous)
HJ(k) = HI[INT(k/P)] (discrete)
185
Image Scaling
B-A
(DEMO)
A
B
P(B-A)
P(B-A)
PA
PB
PA
PB
• An image with a compressed gray level
range generally has reduced visibility – a
washed out appearance (and vice-versa).
186
Linear Point Operations:
Offset & Scaling
• Given reals L and P, a linear point
operation on I is a function
J(i, j) = P·I(i, j) + L
comprising both offset and scaling.
• If P < 0, the histogram is reversed, creating
a negative image. Usually P = -1, L = K-1:
J(i, j) = (K-1) - I(i, j)
(Digital Negative DEMO)
187
Full-Scale Contrast Stretch
• The most common linear point operation.
Suppose I has a compressed histogram:
0A
B
K-1
• Let A and B be the min and max gray levels
in I. Define
J(i, j) = P·I(i, j) + L
such that PA+L = 0 and PB + L = (K-1).
188
Full-Scale Contrast Stretch
• Solving these 2 equations in 2 unknowns yields:
or
K-1
K-1
P =
and L = -A
B-A
B-A
J(i, j) = (K-1) [I(i, j) - A] / (B - A)
• The result is an image J with a full-range histogram:
FSCS (DEMO)
189
0
K-1
Nonlinear Point Operations
• Now consider nonlinear point functions f
J(i, j) = f[I(i, j)].
• A very broad class of functions!
• Commonly used:
J(i, j) = |I(i, j)|
J(i, j) = [I(i, j)]2
J(i, j) = [I(i, j)]1/2
J(i, j) = log[1+I(i, j)]
J(i, j) = exp[I(i, j)] = eI(i,j)
(magnitude)
(square-law)
(square root)
(logarithm)
(exponential)
• Most of these are special-purpose, for example…
190
Logarithmic Range
Compression
• Small groupings of very bright pixels may
dominate the perception of an image at the
expense of other rich information that is less
bright and less visible.
• Astronomical images of faint nebulae and
galaxies with dominating stars are an
excellent example.
191
The Rosette Nebula
192
Logarithmic Range
Compression
• Logarithmic transformation
J(i, j) = log[1+I(i, j)]
nonlinearly compresses and equalizes the
gray-scales.
• Bright intensities are compressed much
more heavily - thus faint details emerge.
193
Logarithmic Range
Compression
• A full-scale contrast stretch then utilizes
the full gray-scale range:
0
typical histogram
K-1
0
K-1
logarithmic transformation
0
K-1
stretched contrast
194
Contrast Stretched Rosette
195
Rosette in color
196
Gamma Correction
• Monitors that display images and videos
often have a nonlinear response.
• Commonly an exponential nonlinearity
Display(i, j) = [I(i, j)]
• Gamma correction is (digital) preprocessing to correct the nonlinearity:
J(i, j) = [I(i, j)]1/
• Then
Display(i, j) = [J(i, j)] = I(i, j)
197
Gamma Correction
• For a CRT (e.g., analog NTSC TV), typically
 = 2.2
• This is accomplished by mapping all
luminances (or chrominances) to a [0, 1] range.
• Hence black (0) and white (1) are unaffected.
• Plasma and LCD have linear characteristics,
hence do not need gamma correction. But many
devices that feed them still gamma correct,
hence reverse nonlinearity often needed.
198
Gamma Correction
199
Histogram Distribution
• An image with a flat histogram makes rich use of the
available gray-scale range. This might be an image with
- Smooth changes in intensity across many gray levels
- Lots of texture covering many gray levels
• We can obtain an image with an approximately flat
histogram using nonlinear point operations.
200
Normalized Histogram
• Define the normalized histogram:
pI (k) =
1
H I (k) ; k = 0 ,..., K-1
MN
• These values sum to one:
K-1
 pI (k) = 1
k=0
• Note that pI(k) is the probability that gray-level k will occur
(at any given coordinate).
201
Cumulative Histogram
• The cumulative histogram is
PI (r) =
r
 pI (k) ; r = 0 ,..., K-1
k=0
which is non-decreasing; also, PI(K-1) = 1.
• Probabilistic interpretation: at any (i, j):
PI(r) = Pr{I(i, j) ≤ r}
pI(r) = PI(r) - PI(r-1) ; r = 0,..., K-1
202
Continuous Histograms
• Suppose p(x) and P(x) are continuous: can regard
as probability density (pdf) and cumulative
distribution (cdf).
• Then p(x) = dP(x)/dx.
• We’ll describe histogram flattening for the
continuous case, then extend to discrete case.
203
Continuous Equalization
• Transform (continuous) I, p(x), P(x) into
image K with flat or equalized histogram.
• The following image will have a flattened
histogram with range [0, 1]:
J = P(I)
(J(i, j) = P[I(i, j)] for all (i, j))
204
Continuous Flattening
• Reason: the cumulative histogram Q of J:
Q(x) = Pr{J ≤ x} (at any pixel (i, j))
= Pr{P(I) ≤ x} = Pr{I ≤ P-1(x)}
= P[P-1(x)] = x
hence q(x) = dQ(x)/dx = 1 for 0 < x < 1
• Finally, K = FSCS(J).
205
Discrete Histogram Flattening
• To approximately flatten the histogram of
the digital image I:
• Define the cumulative histogram image
J = PI(I)
so that
J(i, j) = PI[I(i, j)].
• This is the cumulative histogram evaluated
at the gray level of the pixel (i, j).
206
Discrete Histogram Flattening
• Note that
0 ≤ J(i, j) ≤ 1
• The elements of J are approximately
linearly distributed between 0 and 1.
• Finally, let K = FSCS(J) yielding the
histogram-flattened image.
207
Histogram Flattening Example
• Given a 4x4 image I with gray-level range
{0, ..., 15} (K-1 = 15):
I=
• The histogram:
k
1
2
8
4
1
5
1
5
3
3
8
3
4
2
2
11
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
H(k) 0 3 3 3 2 2 0 0 2 0 0 1 0 0 0 0
208
• The normalized histogram…
k
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
p(k) 0
3 3 3 2 2
16 16 16 16 16
0 0
2
16
0 0
1
16
0 0 0 0
• The intermediate image J is computed followed by the
"flattened" image K (after rounding/FSCS):
J=
3/16 3/16
9/16 11/16
6/16 13/16
9/16 6/16
15/16 3/16 15/16 6/16
11/16 13/16 9/16 16/16
K=
0 0 7
3 12 7
9
3
14 0 14 3
9 12 7 15
209
• The new, flattened histogram looks like this:
k
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
H(k) 3 0 0 3 0 0 0 3 0 2 0 0 2 0 2 1
Histogram Flattening
(DEMO)
• The heights H(k) cannot be reduced - only stacked
• Digital histogram flattening doesn't really
"flatten" - just spreads out the histogram – more
flat.
• The spaces that appear are characteristic of
"flattened" histograms - especially when the
original histogram is highly compressed.
210
Multilayer Perceptrons
211
Multilayer Perceptrons (MLPs)
• Neural networks began to fulfill their promise with MLPs.
Often called “vanilla” or “artificial neural networks.”
• Operate by layering perceptrons in a feed-forward manner.
• Can learn to represent data that is not linearly separable, hence
also complex patterns, including those in images.
• However, not by feeding pixels directly into MLPs (yet).
• Uses more powerful and diverse activation functions.
• Can be efficiently trained via a method called backpropagation.
212
Multilayer Perceptrons
• Add a layer between the input layer and the output layer. It will
also consist of:
– summation
– activation function
• The new “hidden layer” (invisible to input and output) can:
– potentially receive all inputs
– potentially feed all output layer nodes
• We will also use different activation functions than the signum
function having better properties, like differentiability.
• We can easily have more outputs.
213
Activation Functions
• The signum function f(x) = sign(x) is discontinuous. Other
functions are possible that can limit or rectify neural responses.
• They are not binary, which affords greater freedom (for
regression, instead of just classification).
• The two most popular activation functions used with ANNs
(before the “deep” era) were the tanh and logistic functions
which are sigmoid (s-shaped) functions.
e x  e-x
f(x)  tanh(x) = x -x
e +e1
f(x)  logistic(x) =
1+e-x
• Both limit and are differentiable.
214
Multilayer Perceptron Diagram
• This example has two outputs and is fully connected. Every node
in a layer feeds every node in the next layer.
i(1)
i(2)
input i
i(3)
···
···
outputs y
i(P)
input
layer
weights
w2
output
weights layer
w1
R

• Each large node is of the form f   w q (r)i q (r)  where iq are the R
 r 1

inputs of a qth layer with weights wq and activation function f.
215
Still Too Much Computation
• … to operate on pixels directly (back then).
• Huge computation:
– P2 weights in w2
– Only 2P in w1, but there can be as many outputs as pixels.
• A 1024x1024 image (P=220) implies 240 weights in w2!*
• Clearly untenable. The number of nodes, or connections (or
both) must somehow be greatly limited.
• Using small images obviously helps. A 128x128 image (P= 214)
implies P= 228 weights in w2. But that’s not the answer either…
*Note: 240  1 Trillion, while 214  250 Million
216
Features
• Instead, highly informative features would be extracted
from images. Trained on many images’ features.
• Vastly fewer than the number of pixels: often just single
digits, or dozens.
• Could be simple image statistics, Fourier features,
wavelet/bandpass filter data, regional colors, “busyness”
and so on. VAST varieties have been used.
• We’ll talk about MANY later: SIFT, SURF, LBP, etc etc.
• These days they are typically called “handcrafted” as
opposed to “learned.” These days, “handcrafted” is
somewhat of a pejorative in the ML community!
217
Example Feature-Driven MLP
Day vs Night Detector
• Task: determine whether an image was taken in daytime or night
time.
• The following intuitive features might be extracted from each
training image:
– Average luminance (brightness) Lave of the image
– Standard deviation of luminance Ldev of the image
– Color saturation Csat of the image (how widely distributed is color)
– Maximum luminance Lmax of the image
• Are these the “right” features? They make sense, but who knows?
• Normalization: Usually each feature is normalized to [0, 1] (e.g.)
so no feature has an outsized effect (unless that’s desired).
218
Day vs Night Detector Network
• A small fully connected network. Every node in a layer feeds
every node in the next layer. Output can be continuous or
thresholded.
Lave
Output y
Ldev
input i*
“probability”
of daytime”
Csat
L max
weights
input
layer
weights
weights
[4, 5]
[5, 3]
w3
*Still calling the input vector i
w2
w1
[3, 1]
output
layer
219
1
-1
e x  e-x
tanh(x) = x -x
e +e
1
0
1
logistic(x) =
1+e-x
220
Training by Backpropagation
• Perhaps the greatest advance in the history of neural networks –
except for the Perceptron, of course.
• “Backprop” is short for “backward propagation of errors.”
• It means the weights w1, …, wL of all layers are adjusted to
minimize the training error w.r.t. the known training labels
(day/night, or whatever).
• It is convenient if the activation function fis differentiable.
• If f(x) = tanh(x), then f(x) = 1 – tanh2(x)
• If f(x) = logistic(x), then f(x) = ex logistic2(x)
221
Training by Backpropagation
•
Given training labels T = {tp; 1 < p < P} of a neural network
with outputs Z = {zp; 1 < p < P}, form the MSE loss function*
1 P
E   (z p  t p ) 2
2 p 1
•
Then for an arbitrary neuron anywhere in the network
indexed k with output
 J

y k  f   w k ( j)y j 
 j1

•
The goal is to minimize the loss function E over all weights wk
= {wk(j); 1 < j < J} by gradient descent.
*The ½ is just to cancel a later term
222
Double Chain Rule
• Differentiate E w.r.t. each weight
E
E y k x k

w k ( j) y k x k w k ( j)
• Note that
yk  f  x k 
J
x k   w k (i)yi
i 1
x k
 yj
w k ( j)
y k
 f(x k )
x k
• If neuron k is in the output layer (yk = zk) then
E(y k ) E

 zk  t k
y k
z k
1 P
E   (z p  t p ) 2
2 p 1
223
Hidden Layer Neuron
•
···
···
•
Let neuron j be at arbitrary location. Then E is a function of
all neurons V = {a, b, c, …, z} receiving input from j (in the
next layer) and
a
E  y k  E  x a , x b ,..., x g 

yj
y k
y k
b
Take the total derivative:
c
E
E x v
E y v


w k (v)
wj
y k vV x v y k vV y v x v
z
•
A recursion! The derivative wrt yk can be computed from the
derivatives wrt the outputs {yv} of the next layer.
224
Putting it Together
•
So finally
where
•
E
 δk y j
w k ( j)

f
E y k  (x k )  y k  t k 
δk 
  f(x ) w (v)δ
k 
k
v
y k x k 
vV

Backprop proceeds by gradient descent. For a learning rate ,*
E
w k ( j)   γ
  γδ k y j
w k ( j)
•
We will see other loss functions and activation functions later.
225
*Picking  can be a trial-and-error process!
Training, Validation, and Testing
• MLPs/ANNs are tools to construct algorithms that learn from data
(training) and make predictions (testing) from the learned model.
• There is a Universal Approximation Theorem (Cybenko 1989, Hornik 1991)
that states that even a single-layer feed-forward MLP can approximate any
continuous function defined on a compact* subset of Rn arbitrarily closely,
provided that the activation function is a bounded, continuous function.
• This suggests the potential of MLPs to be well-trained to make numerical
predictions, but it is no guarantee that an MLP is a good predictor!
• MLPs must be trained on adequately sizable and representative data, and
must be validated.
*Closed and bounded
226
Training, Validation, and Testing
• Basic process: given training samples
T=
 i , t  ,  i , t
1
1
2
2
 ,...,  i Q , t Q 
• Feed sequentially to the MLP optimizing by backprop using gradient descent
(GS). Repeat epochs until a stopping criteria is reached (based on loss), possibly
randomly reordering the samples (stochastic GS).
• Once learned, apply the model on a separate validation set
V=
 j , v  ,  j , v  ,...,  j , v 
1
1
2
2
P
P
on which the network parameters can be tuned (e.g. node density,
#layers, etc) and a stopping point decided (e.g., beyond which the
loss starts to rise from overfitting (too little data, usually).
• Finally, a separate test set is used to measure the performance of the model. All
these can be drawn from a same large dataset but must be disjoint.
• Lastly, the input features are usually normalized to (say) [0, 1] so that the
227
feature range will not affect the results too much.
Case Study: Case Study:
Classifying Microcalcifications
228
Classifying Microcalcifications
• Based on a paper by Chan et al (1997) here.
• Idea: use texture features on “regions of
interest” (ROIs) of mammograms to train a
MLP to classify as benign or malignant.
• Great example of using a small number of
features from a small number of images to train
a MLP classifier.
229
Microcalcifications
Microcalcifications in malignant breast tissue230
Features
• 13 texture features (simple computed
statistics) of ROIs:
–
–
–
–
–
–
–
–
–
–
–
–
–
sample correlation
sample entropy
sample energy
sample inertia
inverse difference moment
sum average,
sum entropy
difference entropy
difference average
sum variance
difference variance
information measure of correlation 1
information measure of correlation 2
231
Training an MLP
The MLP
232
Training, Validation, and Testing
• A total of 86 mammograms from 54 cases were (26
benign, 28 malignant) were used. All recommended for
surgical biopsy.
• In each training phase, MLP was trained on 85, tested on
one. Repeated 86 times. Called “leave one out” training.
• MLP output was thresholded to classify.
• 26 of 26 malignant cases identified (100% sensitivity)
• 11 of 28 benign cases identified (39% specificity)
233
Area under ROC
Convergence
Many epochs needed before converging.
234
General Comments
• Studies like this were interesting, but required too much
computation for larger data.
• Training is a huge burden.
• HENCE: MLPs / ANNs remained a topical interest that
dwindled through the 90’s.
• However, another method succeeded quite well and
remains popular today.
235
Comments
• We will next look at spatial frequency
analysis and processing of images …
onward to Module 3!
236
Module 3
Fourier Transform





Sinusoidal Image
Discrete Fourier Transform
Meaning of Image Frequencies
Sampling Theorem
Radial Basis Functions and Support
Vector Machines
QUICK INDEX
237
Sinusoidal Images
• An image with the simplest frequency content
is a sinusoidal image.
• A discrete sine image I has elements
v
u
I(i, j) = sin [2p( N i + M j)]
for 0 ≤ i ≤ N-1, 0 ≤ j ≤ M-1
and a discrete cosine image has elements
v
u
I(i, j) = cos [2p( N i + M j)]
where u,v are integer frequencies in the iand j-directions (cycles/image).
238
Spatial Frequencies
239
Radial Frequency
• The radial frequency (how fast the image
oscillates in its direction of propagation) is
W = u2 + v2
• The angle of the wave (relative to i-axis) is
 = tan-1(v/u)
(Discrete Sinusoid DEMO)
240
Digital Sinusoidal Example
• Let N = 16, v = 0: I(i) = cos (2pui/16): a cosine wave
oriented in i-direction with frequency u. One row:
u=1 or u=15
1
1
0
0
-1
0
2
4
6
8
i
10 12 14 16
-1
u=2 or u=14
1
0
0
2
4
6
8
i
10 12 14 16
-1
u=8
u=4 or u=12
1
0
0
2
4
6
8 10 12 14 16
i
-1
0
2
4
6
8 10 12 14 16
i
• Note that I(i) = cos (2pui/16) = cos [2p(16-u)i/16].
• Thus the highest frequency wave occurs at u = N/2 (N
is even here). This will be important later.
241
Complex Exponential Image
• We’ll use complex exponential functions to define
the Discrete Fourier Transform.
• Define the 2-D complex exponential:
v 

u
exp -2π -1  i + j   for 0  i  N-1,0  j  M-1
 N M 

where -1 is the pure imaginary number.
• The complex exponential allows convenient
representation and manipulation of frequencies.
242
Properties of Complex
Exponential
• We will use the abbreviation
2p 

WN = exp  - -1

N


(N = image dimension).
• Hence
v

u
exp -2π -1  i +
N M


vj
j   = WNui WM

243
Complex Exponential Image
• Euler's identity:
and
 2p
WN =cos 
N

ui
WN =cos  2π


 2p 
 - 1 sin  

N
u 
 u 
i  - 1 sin  2π i 
N 
 N 
• The powers of WN index the frequencies of the
component sinusoids.
244
Simple Properties


v  1
 u
vj
-vj
cos  2π  i + j   =
WNui WM
+ WN-ui WM
  N M  2
 u
v 
1
vj
-vj
sin  2π  i + j   = -1 WNui WM
- WN-ui WM
2
  N M 

WNui

 2 u 
u 
2
= cos  2π i  + sin  2π i   = 1
 N 
 N 


ui
1 
WN  tan sin  2π


u 
u
 u 
i  / cos  2π i    2π i
N 
N
 N 
245
Comments
• Using WNui WMvj to represent a frequency component
oscillating at u (cy/im) and v (cy/im) in the i- and jdirections simplifies things considerably.
• It is useful to think of WNui WMvj as a representation
of a direction and frequency of oscillation.
246
Complex Exponential
• The complex exponential
WNui
2π


= exp  - -1 ui 
N


is a frequency representation indexed by exponent ui.
• Minimum physical frequencies: u = kN, k integer
WN0i = WNkNi =1 for integer i
• Maximum physical frequencies: u = (k+1/2)N (period 2)
WN(kN+N/2)i =1  WN(N/2)i = ( 1)i
(N even)
247
DISCRETE FOURIER
TRANSFORM
• Any N x M image I is uniquely expressed as the weighted
sum of a finite number of complex exponential images:
1 N-1 M-1 
-ui -vj
I(i, j) =
I(u,
v)W


N WM
NM u=0 v=0
(IDFT)
• The weights I are unique.
• The above is the Inverse Discrete Fourier
Transform or IDFT
248
Sum of Waves Concept
• The representation of images by sinusoids can
be thought of in terms of interference:
249
Forward DFT
• The forward transform:
N-1 M-1
I(u, v) = 
 I(i, j) WNui WMvj
(DFT)
i=0 j=0
• Essentially the same form as the IDFT. I and I can
be uniquely obtained from one another.
• Remember that (i, j) are space indices, while (u,v)
are spatial frequency indices.
250
DFT Matrix
• The DFT has the same dimensions (N x M) as the
image I:
I =  I(u,v); 0  u  N-1,0  v  M-1
• It is a linear transformation:
DFT a1I1 + a 2 I2    a L I L   a1I1 + a 2 I 2    a L I L
251
DFT Matrix Properties
• The DFT is generally complex:
where
I  I Real  1 I Imag
I (u, v) 
Real
N-1 M-1

i=0
v
 u
 I(i, j) cos 2π  N i + M
 
j=0
N-1 M-1
I
Imag (u, v)   
i=0

j 

v
 u
 I(i, j)sin  2π  N i + M
 
j=0

j 

252
DFT Phasor
• Complex DFT has magnitude and phase:
I   I(u, v) ;0  u  N-1, 0  v  M-1
I  I(u, v);0  u  N-1, 0  v  M-1
where
I(u, v) 
hence
2
2
I Real
(u, v)  IImag
(u, v)
I(u, v) = tan -1  IImag (u,v)/I Real (u,v) 
I(u, v) = I(u, v) exp

1I(u, v)

253
The Importance of Phase
• As in 1-D the magnitude of the DFT is
displayed most often.
• The DFT phase usually appears unrevealing
• Yet the phase is at least as important
Example of Importance of Phase
254
Symmetry of the DFT
• The DFT is conjugate symmetric:
I(N-u, M-v) = I* (u, v) ; 0  u  N-1, 0  v  M-1
since
N-1 M-1
I(N-u, M-v) = 
(N-u)i (M-v)j
I(i,
j)
W
WM

N
i=0 j=0
N-1 M-1
=
Ni -ui Mj -vj
I(i,
j)
W

N WN WM WM
i=0 j=0
N-1 M-1
=
 I(i, j)
i=0 j=0
*
ui
vj
 WN WM 


= I* (u, v)
255
More Symmetry Properties
• The symmetry of the DFT matrix implies that it is
redundant.
• We also have
and
I Real (N-u, M-v) = I Real (u, v)
I Imag (N-u, M-v) =  IImag (u, v)
I(N-u, M-v) = I(u, v)
I(N-u, M-v) =  I(u, v)
for 0 < u < N-1, 0 < v < M-1
256
Displaying the DFT
• The DFT of an image is usually displayed as images of
magnitude and of phase.
• The magnitude and phase values are given gray-scale
values / intensities.
• The phase is usually visually meaningless.
• The magnitude matrix is usually logarithmically
transformed (followed by a FSCS) prior to display:
log 1+ I(u, v) 
257
I
I
I
log 1 I
DC


258
• Note that the coefficients of the highest physical
frequencies are located near the center of the
DFT matrix: near (u, v) = N/2, M/2).
v
(0, 0)
(0, M-1)
low
freqs
low
freqs
high
freqs
u
low
freqs
(N-1, 0)
low
freqs
(N-1, M-1)
259
Periodicity of the DFT
• The DFT matrix is finite (N x M):
I =  I(u,v); 0  u  N-1,0  v  M-1
• Yet if the indices are allowed to range outside of
0 < u < N-1, 0 < v < M-1, then the DFT is periodic with
periods N and M:
I(u+nN, v+mM) = I(u, v)
for any integers n and m.
260
Proof of DFT Periodicity
N-1 M-1
I(u+nN, v+mM) = 
(u+nN)i (v+mM)j
I(i,
j)
W
WM

N
i=0 j=0
N-1 M-1
=
ui vj nNi mMj
I(i,
j)
W

N WM WN WM
i=0 j=0
N-1 M-1
=
ui vj 
I(i,
j)
W

N WM = I(u, v)
i=0 j=0
• This is called the periodic extension of the DFT
261
(0,0)
Periodic extension of DFT
262
Periodic Extension of Image
• The IDFT equation
1 N-1 M-1 
-ui -vj
I(i, j) =
I(u,
v)W


N WM
NM u=0 v=0
implies the periodic extension of the image as well:
I(i+nN, j+mM) = I(i, j)
263
Proof of Image Periodicity
1 N-1 M-1 
-u(i+nN) -v(j+mM)
I(i+nN, j+mM) =
I(u,
v)W
WM


N
NM u=0 v=0
1 N-1 M-1 
-ui -vj -unN -vmM
=
I(u,
v)W
WM


N WM WN
NM u=0 v=0
1 N-1 M-1 
-ui -vj
=
I(u,
v)W


N WM = I(i, j)
NM u=0 v=0
• When the DFT is used, it implies the periodicity of the image.
This is important when the DFT is used – e.g. for convolution.
264
Periodic extension of image
265
Centering the DFT
• Usually, the DFT is displayed with DC coordinate
(u, v) = (0, 0) at the center.
• Then low frequency info (which dominates most
images) will cluster at the center of the display.
• Centering is accomplished by taking the DFT of the
alternating image:
(-1)i+jI(i, j)
• This is for display only!
266
Centering the DFT
• Note that
so
-jM/2
(-1)i+j  (-1)i (-1) j  WN-iN/2 WM
N-1 M-1
DFT (-1) I(i, j)  = 
i+j
i+j
ui vj
I(i,
j)
(-1)
W

N WM
i=0 j=0
N-1 M-1
=
 I(i,
i=0 j=0
N-1 M-1
=
ui vj - N/2 i - M/2  j
j) WN WM WN
WM
 u- N/2   i
 I(i, j) WN
i=0 j=0
= I  u-  N/2  , v-  M/2  
 v- M/2   j

WM
267
Shifted (centered) DFT
from periodic extension
268
Centered DFT
v
(-N/2, -M/2)
(-N/2, M/2)
high
high
low
u
centered
high
(N/2, -M/2)
high
(N/2, M/2)
• DFT Example DEMO
269
Computation of the DFT
• Fast DFT algorithms collectively referred to
as the Fast Fourier Transform (FFT).
• We won’t study these – take a DSP class.
• Available in any math software library.
• Forward and inverse DFTs essentially
identical.
270
THE MEANING OF
IMAGE FREQUENCIES
• Easy to lose the meaning of the DFT and
frequency content in the math.
• We may regard the DFT magnitude as an image
of frequency content.
• Bright regions in the DFT magnitude "image"
correspond to frequencies having large
magnitudes in the actual image.
• DFT Examples: DEMO
271
IMAGE GRANULARITY
• Large DFT coefficients near the origin suggest
smooth image regions.
• Images are positive, so DFTs usually have a
large peak at (u, v) = (0, 0).
• The distribution of DFT coefficients is related
to the granularity / “busy-ness” of the image.
272
MASKING DFT
GRANULARITY
• Define toroidal zero-one masks (white = 1)
low-frequency mask
mid-frequency mask
high-frequency mask
• Masking (multiplying) a DFT with these will
produce IDFT images with only low-, middle-, or
high frequencies: DEMO
273
Image Directionality
• Large DFT coefficients along certain
orientations correspond to highly
directional image patterns.
• The distribution of DFT coefficients as a
function of angle relative to the axes is
related to the directionality of the image.
274
MASKING DFT
DIRECTIONALITY
• Define oriented, angular zero-one masks:
• The frequency origin is at the center
of each mask: DEMO
275
THE DISCRETE-SPACE
FOURIER TRANSFORM (DSFT)
• The DFT I of I is NOT the Fourier Transform of I!
The Discrete-Space Fourier Transform:


I D (ω, λ) =   I(i, j) e

i=- j=-
I(i, j) =
1
(2π)
π π
2
- -1 ωi+λj
-π -π ID (ω, λ) e
-1 ωi+λj
(DSFT)
dω dλ (IDSFT)
276
Discrete-Space Fourier Transform
(DSFT)
• 2pPeriodic in frequency along both axes.
• No implied spatial periodicity.
• Continuous in spatial frequency.
• Transform is asymmetric (sum / integral).
277
Relating DSFT to DFT
• The DFT is obtained by sampling the DSFT:
I(u, v) = I (ω, λ) 2π
2π
D
ω=
u,
λ=
v

N
M
; u = 0,..., N-1, v = 0,..., M-1
• Although the DFT samples the DSFT, it is a
complete description of the image I.
278
Sampling
• Let’s study the relationships between the
DFT/DSFT and the Fourier transform of the
original, unsampled image.
• Digital image I is a sampled version of a
continuous distribution of image
intensities I C (x, y) incident upon a sensor
279
Continuous Fourier Transform
• The continuous image I C (x, y) has a Continuous
Fourier Transform (CFT) IC (W, L ) where (x, y)
 L) are space
are space coordinates and (W,
frequencies:


- -1 xW +yL 
IC (W, L ) = - - IC (x, y) e

 
1
I C (x, y) =
I (W, L ) e
2 -  -   C
 2π 
dx dy
-1 xW +yL 
(CFT)
dW dL (ICFT)
280
Continuous Fourier Transform
(CFT)
• Not periodic in frequency or space.
• Continuous in spatial frequency.
• Transform is symmetric (integral/integral).
281
Relating the CFT and DSFT/DFT
• Assume IC (W, L ) is bandlimited, or zero outside a

certain range of frequencies:
IC (W, L ) = 0 for W  W0 , L  L 0

L0
L
W0
W
282
On Bandlimitedness
• Any real-world image is effectively
bandlimited (its CFT becomes vanishingly
small for large W, L).
• If it were not so the image would contain
infinite energy


- - IC (x, y)
2
dx dy =
1
 2π 

 

I (W,
2 - -  C
2
L ) dW dL
283
Image Sampling
• I(i, j) samples the continuous image with spacings
X, Y in the x-, y-directions:
I(i, j) = IC (iX, jY) ; i = 0 ,..., N-1, j = 0 ,..., M-1
• The DSFT and CFT are related by:
1  
n
m

I D (ω, λ) =
IC  W - , L - 



XY n=- m=-   X
Y  (Ω,Λ) =
1 ω λ
( , )
2π X Y
1  
1
 1

=
I
ω-2πn
,
λ-2πm






C
XY n=- m=-   2πX
2πY

284
Relating DFT to CFT
• Have:
I(u, v) = I D (ω, λ) 2π
2π
ω=
u,
λ=
v

N
M
1  
1  u  1  v

=
I
-n
,
-m






C
XY n=- m=-   X  N  Y  M  
• A sum of shifted versions of the sampled CFT. It is
periodic in the u- and v-directions with periods (1/X)
and (1/Y), respectively.
285
1
X
1
Y
u
v
28
6
Unit-Period Case
• We can always set X = Y = 1:
I(u, v) =


 u 
  IC  N -n  ,


n=- m=-
 v

 -m  
 M 
287
Relating the Transforms
SPACE
SPATIAL
FREQUENCY
CFT
Not Periodic
Not sampled
Not periodic
Not sampled
DSFT
Not Periodic
Sampled
Periodic
Not sampled
Periodic
Sampled
Periodic
Sampled
DFT
288
Sampling Theorem
• If W0 > (1/2X) or L0 > (1/2Y), the replicas of the
CFT will overlap (sum up), distorting them. This
is called aliasing.
• The digital image can be severely distorted.
• To avoid aliasing, the sampling frequencies
(1/X) and (1/Y) and must be at least twice the
highest frequencies W0 and L0 in the continuous
image.
289
Comments on Sampling
• A mathematical reason why images must
be sampled sufficiently densely. If violated,
image distortion can be visually severe.
• If Sampling Theorem is satisfied, then the
DFT is periodic (sampled) replicas of the
CFT.
290
Aliased Chirp Image
• A chirp image
I(x, y) = A 1+cos(f (x, y))  =A 1+cos(ax 2 + by2 ) 
has instantaneous spatial frequencies
(Winst , L inst ) =  φ x (x, y),φ y (x, y)  = 2  ax, by 
which increase linearly away from the origin.
291
Aliased Image
Sand Dune Image
Centered DFT Showing Aliasing
292
Some Important Closed-Form DFTS
• Only a few closed form DFTs.
• Let I(i, j) = c for 0 ≤ i ≤ N-1, 0 ≤ j ≤ M-1
• Then
I(u, v) = c NM δ(u, v)
where
 1; u=v= 0
δ(u, v) = 
= unit impulse
 0 ; else
293
2-D Unit Pulse Image
• Let I(i, j) = c·d(i, j)
• Then
N-1 M-1
I(u, v) = 
 cδ(i, j)
i=0 j=0
0
0
= cWN WM = c
ui vj
WN WM
(constant DFT)
• This is extremely important!
294
Cosine Wave Image
• Let
c
 b
I(i, j) = d  cos  2π  i +
 N M
• Then
   d   bi cj
-cj 
j       WN WM  WN-bi WM

  2 
N-1 M-1
d


I(u, v) =      WNbi Wcj  WN-bi W -cj  WNui W vj
M
M 
M
 2  i=0 j=0 
 d  N-1 M-1  (u+b)i (v+c)j
(v-c)j 
=      WN WM  WN(u-b)i WM

 2  i=0 j=0
d
=   NM δ(u-b, v-c) + δ(u+b, v+c)
2
295
Sine Wave Image
• Likewise, if
c
 b
I(i, j) = d  sin  2π  i +
 N M

j 

then
I(u, v) =  d  NM -1 δ(u-b, v-c) -δ(u+b, v+c)
2
• Sinusoids are concentrated single frequencies
296
DFT AS SAMPLED CFT
• Now some CFT pairs that are either
difficult to express or are too lengthy to do
by hand as DFT (or even as DSFT) pairs.
• If sampled adequately, these are good
approximations to the DSFT/DFT (with
appropriate scaling).
297
Rectangle Function
• Let
A
B

x
 y c ; x  , y 
I C (x, y) = c  rect    rect   = 
2
2
A
 B   0 ; else

• Then
IC (W, L ) = c  A  B  sinc  AW   sinc  BL 

where
1

sin  πx 
1; x 
rect  x  = 
2 and sinc(x) =
πx
 0 ; else
298
Sinc Function
• Let
IC (x, y) = c  sinc  ax   sinc  by 
then
W
 L   c ; W  a, L  b
IC (W, L ) = c  rect    rect   = 

 2a 
 2b   0 ; else
299
Gaussian Function
• Let
then


IC (x, y) = exp   x 2  y2 / σ 2 




IC (W, L) = exp  2π 2σ 2 W 2  L 2 



• The Fourier transform of a Gaussian is also Gaussian
– an unusual property.
• DEMO
300
Radial Basis Function
Networks
301
RBF Nets
• Shallow (3-layer, including the input) networks with a
specific type of activation function.
• Also uses a concept of a center vector.
• Given an input I = {i(p); p = 1, …, P} each neuron
indexed k of the hidden layer has a center vector ck.
• The center vectors are initialized randomly or by
clustering (e.g., using k-means).
302
RBF Nets
K
• The network output is then: f(i ) =  w(k)ρ  i -c k
k 1

• Usually f is gaussian and the norm is Euclidean distance so
ρ  i -c k
 = exp  β i -c
2
k

• Usually a normalized version is used:
K
f(i) =
 w(k)ρ  i-c 
k
k 1
K
 ρ  i-c 
k 1
k
• RBF networks are also universal approximators, and perform well
with a shallow architecture.
303
Support Vector Machines
304
Recall the Perceptron
• It’s a basic binary classifier
w(1)
i(1)
w(2)
i(2)
input i
···
i(3)
i(P)
···
w(3)
w(P)
output y

y  sgn  w T i  b 
305
Maximum Margin Solution
•
Assume linearly separable, normalized training data:
2
w
• Then there will be a pair of
parallel lines / planes /
hyperplanes having maximum
separation / distance d.
• The distance d is the margin, with
width given by the equation for a
point to a line / plane /
hyperplane.
wTi  b   1
w i  b1
• Want separation as large as
possible so ||w|| small
support vectors
T
wTi  b  0
306
Empty Margin
•
Can’t have points fall within the margin, so when
yp = 1 need wTip – b > 1
yp = 1 need wTip – b < -1
or
yp(wTip – b) > 1 for p = 1, …, P.
• The optimization is then
minimize ||w|| subject to yp(wTip – b) > 1 for p = 1, …, P.
Note: This is completely defined by just the support vectors!
• These are hard constraints so called hard margin classifier.
307
Hinge Function
•
For linearly separable classes the hard classifier is the SVM!
•
If nonlinearly separable define the hinge loss function:
max(0, 1 - yp[wTip – b)] for p = 1, …, P.
where yp is the pth known label and wTip – b is the pth output.
• Zero when ip is on correct side of the “margin,” else wTip – b is
distance from the “margin.” Minimize the soft margin problem:
1 P
T

  w
max
0,
1

y
w
i

b



p
p


P p 1
2
308
Kernel Trick
•
Brilliant idea due to Vladimir Vapnik: use a kernel to cast
the data into a higher dimensional feature space, where the
data classes may nicely separate!
Nonlinear
mapping
to a highdimensional
space.
309
Kernel Trick
• HOW: The details are beyond our class, but for a start, the
minimization of the linear SVM functional can be solved by
transforming it into a quadratic optimization problem.
• This has the great advantage of avoiding local minima of
the original functional.
• Solving the quadratic problem requires transforming it into a
dual problem which is expressed in terms of dot products
between input samples ip and iq, p, q = 1, …, P.
• The kernel trick is to replace these dot products by kernel
functions of the data: ip  iq  k(ip, iq).
• Most commonly k is gaussian: exp(l||ipiq||2).
310
SVM Summary
• The “deep network” of its day. Extremely popular and used in
numerous image analysis problems. We’ll see some later.
• Can handle very high-dimensional problems with ease – such
as image analysis!
• Computationally easy since a shallow network architecture!
• Overtraining much less of an issue than deep networks.
• Still widely used when there isn’t enough data to train a deep
neural network or when the problem isn’t too difficult.
311
Comments
• We now have a basic understand of
frequency-domain concepts
• Let’s put them to use in linear filtering
applications… onward to Module 4.
312
Module 4
Linear Image Filtering





Wraparound and Linear Convolution
Linear Image Filters
Linear Image Denoising
Linear Image Restoration (Deconvolution)
Filter Banks
QUICK INDEX
313
WRAPAROUND CONVOLUTION
• Modifying the DFT of an image changes its
appearance. For example, multiplying a
DFT by a zero-one mask predictably
modifies image appearance:
314
Multiplying DFTs
• What if two arbitrary DFTs are (pointwise)
multiplied or one (pointwise) divides the other?
J = I1  I 2 or
J = I1  I 2
• The answer has profound consequences in image
processing.
• Division is a special case which need special
handling if I 2contains near-zero or zero values.
315
Multiplying DFTs
• Consider the product
J = I1  I 2
• This has inverse DFT
1 N-1 M-1 
-ui -vj
J(i, j) =
J(u,
v)W


N WM
NM u=0 v=0
1 N-1 M-1 
-vj
=
I1 (u, v)I 2 (u, v)WN-ui WM


NM u=0 v=0
1 N-1 M-1  N-1 M-1
 N-1 M-1
 -ui -vj
um vn  
up vq 
=
I
(m,
n)W
W
I
(p,
q)W
W


  1
N
M   2
N
M WN WM
NM u=0 v=0  m=0 n=0
  p=0 q=0

316
heck
317
N-1 M-1
N-1 M-1
1 N-1 M-1
u  p+m-i  v  q+n-j
=
I1 (m, n)   I 2 (p, q)   WN
WM


NM m=0 n=0
p=0 q=0
u=0 v=0
N-1 M-1
1 N-1 M-1
=
I1 (m, n)   I 2 (p, q)  NM  δ(p+m-i, q+n-j)


NM m=0 n=0
p=0 q=0
=
N-1 M-1
  I1 (m, n) I 2  i-m  N ,  j-n M 
m=0 n=0
N-1 M-1
=
 I1  i-p  N ,  j-q M  I 2 (p, q)
p=0 q=0
= I1 (i, j)  I 2 (i, j)
I1  I2 is the wraparound convolution of I1with I 2
Note:  p  N = p mod N
318
Wraparound Convolution
• The summation
J(i, j) = I1 (i, j)  I 2 (i, j)
=
N-1 M-1
  I1 (m, n) I2  i-m  N ,  j-n M 
m=0 n=0
is also called cyclic convolution and circular convolution.
• Like linear convolution, it is an inner product between one
sequence and a (doubly) reversed, shifted version of the other
– except with indices taken modulo-M,N.
319
Depicting Wraparound Convolution
• Consider hypothetical images I1 and I 2
j
(0,0)
(0,0)
i
Image I 1
Image I 2
at which we wish to compute the cyclic convolution
at (i, j) in the spatial domain (without DFTs).
320
• Without wraparound:
(0,0)
j
i
I 2 Doubly-reversed
and shifted
(N-1, M-1)
• Modulo arithmetic defines the product for
all 0 < i < N-1, 0 < j < M-1.
321
(0,0)
i
j
If one 2D function is filtering
the other, then it filters together
the left/right and top/bottom
sides of the image!
(N-1, M-1)
Overlay of periodic extension of shifted I 2
Summation occurs over 0 < i < N-1, 0 < j < M-1
322
LINEAR CONVOLUTION
• Wraparound convolution is a consequence of the
DFT, which is a sampled DSFS.
• If two DSFTs are multiplied together:
J D (ω, λ) = I D1 (ω, λ)I D2 (ω, λ)



then useful linear convolution results:
J(i, j) = I1 (i, j)  I 2 (i, j)
• Wraparound convolution is an artifact of sampling
the DSFT – which causes spatial periodicity.
323
About Linear Convolution
• Most of circuit theory, optics, and analog filter
theory is based on linear convolution.
• And … (linear) digital filter theory also requires
the concept of digital linear convolution.
• Fortunately, wraparound convolution can be
used to compute linear convolution.
324
Linear Convolution by Zero Padding
• Adapting wraparound convolution to do
linear convolution is conceptually simple.
• Accomplished by padding the two image
arrays with zero values.
• Typically, both image arrays are doubled in
size:
325
0
Image I1
(zero padded)
0
Image I2
(zero padded)
2N x 2M zero padded images
• Wraparound eliminated, since the "moving" image is
weighted by zero values outside the image domain.
• Can be seen by looking at the overlaps when computing
the convolution at a point (i, j):
326
Wraparound Cancelling Visualized
Linear convolution by zero padding
•
Remember, the summations take place only within the blue shaded
square (0 ≤ i ≤ 2N-1, 0 ≤ j ≤ 2M-1).
327
DFT Computation of Linear
Convolution
•
Let J  , I1 , I2 be zero - padded 2N  2M
versions of J  I1  I 2 . Then if
J  = I1  I2 = IFFT2N 2M  FFT2N 2M I1  FFT2N2M I2 
then the NxM image with elements
J(i, j) = J (i, j) ;
N
2
 1 i 
3N M
2
,
2
 1 j 
3M
2
contains the linear convolution result.
328
On DFT-Based Linear Convolution
•
By multiplying zero-padded DFTs, then taking the IFFT,
one obtains
J = I I
1
•
•
•
2
The linear convolution is larger than NxM (in fact
2Nx2M) but the interesting part is contained in NxM J.
To convolve an NxM image with a small filter (say PxQ),
where P,Q < N,M: pad the filter with zeros to size NxM.
If P,Q << N,M, it may be faster to perform the linear
convolution in the space domain.
329
Direct Linear Convolution
• Assume I1 and I 2 are not periodically extended
(not using the DFT!), and assume that
I1 (i, j)  I 2 (i, j) = 0
whenever i < 0 or j < 0 or i > N-1 or j > M-1.
• In this case
J(i, j) = I1 (i, j)  I 2 (i, j) =
N-1 M-1
  I1 (m, n) I2  i-m, j-n 
m=0 n=0
330
LINEAR IMAGE FILTERING
• A process that transforms a signal or image I by
linear convolution is a type of linear system.
optical
image
I
series of
lenses
with
MTF
transformed
image
J=I *H
H
electrical
current
MTF = modulation transfer function
digital
image
I
lumped
electrical
circuit with
IR
I
H
output
current
J = I*H
IR = impulse response
digital
image filter
output
image
H
J=I *H
Of interest to us
331
Goals of Linear Image Filtering
• Process sampled, quantized images to
transform them into
- images of better quality (by some criteria)
- images with certain features enhanced
- images with certain features de-emphasized
or eradicated
332
impulse noise
gaussian white noise
blur
Albert
JPEG compression
Variety of Image Distortions
333
Characterizing Linear Filters
• Any linear digital image filter can be characterized
in one of two equivalent ways:
(1) The filter impulse response H =  H(i, j) 
 =  H(u,
 v) 
(2) The filter frequency response H


• These are a DFT pair:
 = DFT  H 
H

H = IDFT  H

334
Frequency Response
• The frequency response describes how the system
effects each frequency in an image that is passed
through the system.
• Since
 v) = H(u,
 v) exp
H(u,


 v)
1H(u,
an image frequency component at (u, v) = (a, b) is
 b) and
amplified or attenuated by the amount H(a,
 b)
shifted by the amount H(a,
335
Frequency Response Example
• The input to a system H is a sine image:
c
 b
I(i, j) = cos  2π  i +
 N M

 1
cj
-cj
j   = WNbi WM
+WN-bi WM
 2

• The output is
1 N-1 M-1
-b i-m
-c j-n
b i-m
c j-n
J(i, j)=H(i,j)  I(i, j)=   H(m, n)  WN   WM  +WN  WM  


2 m=0 n=0
N-1 M-1
1 -bi -cj N-1 M-1
bm cn 1
bi cj
-cn
 WN WM   H(m, n)WN WM  WN WM   H(m, n) WN-bm WM
2
2
m=0 n=0
m=0 n=0
1
-cj 
cj 
 c) cos 2π  b i+ c j  H(b,
 c)
  WN-bi WM
H(b, c)  WNbi WM
H(-b, -c) = H(b,
 N M 

2

 

336
Impulse Response
• The response of system H to the unit impulse
1;i = j = 0
δ(i, j) = 
0; else
• An effective way to model responses since every
input image is a weighted sum of unit pulses
I(i, j) =
N-1 M-1
N-1 M-1
m=0 n=0
m=0 n=0
  I  i-m, j-n  δ(m, n)=   I  m, n  δ(i-m, j-n)
337
Linear Filter Design
• Often a filter is to be designed according to
frequency-domain specifications.
• Models of linear distortion in the continuous
domain lead to linear digital solutions.
338
Sampled Analog Specification
• Given an analog or continuous-space spec:

H C (x, y)  H C (W, L )

• Sampled in space (X=Y=1):
H(i, j) = H C (i, j) ; - < i, j < 
(1)
• DFT [by sampling DSFT of (1)]:
 v) =
H(u,


 u   v

H
-n
,
-m
   C  N   M  
 


n=- m=-
(2)
339
Simple Design From
Continuous Prototypes
• Two simplest methods of designing linear
discrete-space image filters from continuous
prototypes:
(1) Space-Sampled Approximation
(2) Frequency-Sampled Approximation
• Derive from formulae (1), (2) on previous slide.
340
Space-Sampled Approximation
• Truncate (1):
Htrunc (i, j) = HC (i, j)
for 0 < |i| < (N/2)-1, 0 < |j| < (M/2)-1.
• A truncation of the analog spec - Gibbs phenomena
will occur at jump discontinuities of H C (W, L )

• The frequency response is
IDFT

H
trunc (u, v)  H trunc (i, j)
341
Frequency Sampled Approximation
• Use m= n = 0 term in (2) – assuming negligible aliasing
u v

H fs (u, v) = H C  , 
 N M
for 0 < |u| < (N/2)-1, 0 < |v| < (M/2)-1.
• The DSFT is NOT specified between samples
• CFT is centered and non-periodic.
• The discrete impulse response is then
DFT
 (u, v)
H fs (i, j)  H
fs
342
Low-Pass, Band-Pass, and
High-Pass Filters
• The terms low-pass, band-pass, and high-pass are
qualitative descriptions of a system's frequency
response.
• "Low-pass" - attenuates all but the "lower" frequencies.
• "Band-pass" - attenuates all but an intermediate range
of "middle" frequencies.
• "High-pass" - attenuates all but the "higher"
frequencies.
• We have seen examples of these: the zero-one
frequency masking results.
343
Generic Uses of Filter Types
• Low-pass filters are typically used to
- smooth noise
- blur image details to emphasize gross features
• High-pass filters are typically used to
- enhance image details and contrast
- remove image blur
• Bandpass filters are usually special-purpose
344
Example Low-Pass Filter
• The gaussian filter with frequency response
hence


2
H C (W, L ) = exp  -2  πσ  W2  L 2 



 2 2  u 2  v 2  

H(u,
v) = exp -2π σ       
 N   M   

which quickly falls at larger frequencies.
• The gaussian is an important low-pass filter.
345
Gaussian Filter Profile
1.0
1.0
0
0
u
u
N = 32, s = 1
N = 32, s = 1.5
Plots of one matrix row (v = 0)
DEMO
346
Example Band-Pass Filter
• Can define a BP filter as the difference of two LPFs
identical except for a scaling factor.
• A common choice in image processing is the
difference-of-gaussians (DOG) filter:




2
2
H C (W, L ) = exp  -2  πσ  W2  L 2  - exp  -2  Kπσ  W2  L 2 





hence


 u 2  v 2  
 u 2  v 2  
2
2

H(u,
v) = exp -2  πσ         - exp -2  Kπσ        


 N   M   
 N   M   
• Typically, K ≈ 1.5.
347
DOG Filter Profile
1.0
N=32
0
u
• DOG filters are very useful for image analysis –
and in human visual modelling.
• DEMO – Take K=1.5, s < 5
348
Example High-Pass Filter
• The Laplacian filter is also important
hence


H C (W, L ) = A W2  L 2

 u 2  v 2 

H(u,
v) = A      
 N   M  
although this is a severely truncated approximation!
Best used in combination with another filter (later).
• An approximation to the Fourier transform of the
continuous Laplacian:
2
2


2 =
 2
2
x y
349
Laplacian Profile
1.0
A = 4.5, N = 32
0
u
• DEMO
350
LINEAR IMAGE DENOISING
• Linear image denoising is a process to (try to)
smooth noise without destroying the image
information.
• The noise is usually modeled as additive or
multiplicative.
• We consider additive noise now.
•
• Multiplicative noise is better handled by a
homomorphic filtering that uses nonlinearity.
351
Additive White Noise Model
• Model additive white noise as an image N with
highly chaotic, unpredictable elements.
• Can be thermal circuit noise, channel noise,
sensor noise, etc.
• Noise may effect the continuous image before
sampling:
J C (x,y) = IC (x,y) + N C (x,y)
  


observed
original
white noise
352
Zero-Mean White Noise
• The white noise is zero-mean if the limit of the
average of P arbitrary noise image realizations
vanishes as P → ∞:
1 P
N C,p (x, y)  0 for all (x,y) as P  

P p=1
• On average, the noise falls around the value zero.*
*Strictly speaking, the noise is also "mean-ergodic."
353
Spectrum of White Noise
• The noise energy spectrum is
N C (W, L ) = N C (x, y)

• If the noise is white, then, on average, the energy
spectrum will be flat (flat spectrum = ‘white’):
1 P
N C,p (W, L )  η for all (W, L ) as P  

P p=1 
• Note: η2 is called noise power.
354
White Noise Model
• White noise is an approximate model of additive
broadband noise:
J C (x,y) = IC (x,y) +
 
observed
original
I C(x)
N C (x,y)



broadband noise
NC (x)
+
x
x
I C (W )

+
W
Just a depiction –
Magnitudes
aren’t added
N C (W )

W
355
Linear Denoising
• Objective: Remove as much of the high-frequency
noise as possible while preserving as much of the
image spectrum as possible.
• Generally accomplished by a LPF of fairly wide
bandwidth (images are fairly wideband):
H C (W )

0
W
356
Digital White Noise
• We make a similar model for digital zero-mean additive
white noise:
J = I + N

observed
original
noise
• On average, the elements of N will be zero.
• The DFT of the noisy image is the sum of the DFTs of the
original image and the noise image:

J = I + N
observed
original
noise
• On average the noise DFT will contain a broad band of
frequencies.
357
White Noise Maker Demo
Denoising - Average Filter
• To smooth an image: replace each pixel in a noisy
image by the average of its M x M neighbors:
1/M
2
1/M
2
0
3x3 window
1/M
2
2
1/M
convolution template
358
Average Filter Rationale
• Averaging elements reduces the noise mean
towards zero.
• The window size is usually an intermediate value to
balance the tradeoff between noise smoothing and
image smoothing.
• Typical average filter window sizes: L x L = 3 x 3,
5 x 5 ,..., 15 x 15 (lots of smoothing), e.g. for a 512
x 512 image.
359
Average Filter Rationale
• Linear filtering the image (with zero-padding
assumed hereafter)
K = H J = H I + H* N
 =H
  J = H
  I + H
 N

K
will affect image / noise spectra in the same way:

H(u)

H(u)
1
1
smaller L
u
larger L
DEMO
u
360
Denoising – Ideal Low-Pass Filter
• Also possible to use an ideal low-pass filter by
designing in the DFT domain:
 1 ; if u 2 +v 2  U
cutoff

H(u,
v) = 
 0 ; otherwise
• Possibly useful if its possible to estimate the highest
important radial frequency U cutoff in the original
image.
361
Ideal LPF
(0, 0)
u
v
U cutoff
(N-1, M-1)
centered DFT
DEMO
362
Denoising - Gaussian Filter
• The isotropic Gaussian filter is an effective smoother:
 2 2  u 2 +v 2  

H(u,
v) = exp  -2π σ 
2 

 N  
• It gives more weight to “ closer” neighbors.
• DFT design: Set the half-peak bandwidth to U cutoff
by solving for s:

 2

U
1
exp -2π 2σ 2  cutoff

2 

 N  2
DEMO
 N 
 N 
 σ=
 log 2  0.19 

 πU cutoff 
 U cutoff 363

Summary of Smoothing Filters
Space
• Average filter: Noise
leakage through frequency
ripple (spatial discontinuity)
• Ideal LPF: Ringing from
spatial ripple (frequency
discontinuity)
• Gaussian: No discontinuities.
No leakage, no ringing.
Frequency
H(i)

H(u)
H(i)

H(u)
H(i)

H(u)
364
Minimum Uncertainty
• Amongst all real functions and in any dimension, the noripple Gaussian functions uniquely minimize the uncertainty
principle:
2
 xf (x, y) 2 dx   uf (u, v) du  1




  f (x, y) 2 dx   f (u, v) 2 du  4

 

• Similar for y, v.
• They have minimal simultaneous space-frequency
durations.
365
LINEAR IMAGE DEBLURRING
• Often an image that is obtained digitally has already
been corrupted by a linear process.
• This may be due to motion blur, blurring due to
defocusing, etc.
• We can model such an observed image as the result
of a linear convolution:
J C (x, y) = G C (x, y)  I C (x, y)





 


so
observed
linear distortion
original
J C (W, L ) = G C (W, L )  IC (W, L )



 

 
 
 
observed
linear distortion
original
366
Digital Blur Function
• The sampled image will then be of the form (assuming
sufficient sampling rate)
hence
J = G I
  I
J = G
• The distortion G is almost always low-pass (blurring).
• Our goal is to use digital filtering to reduce blur – a VERY
hard problem!
famous example
367
Deblur - Inverse Filter
• Often it is possible to make an estimate of the
distortion G.
• This may be possible by examining the physics of
the situation.
• For example, motion blur (relative camera
movement) is usually along one direction. If this
can be determined, then a filter can be designed.
• The MTF of a camera can often be determined –
and hence, a digital deblur filter designed.
368
Deconvolution
• Reversing the linear blur G is deconvolution. It is
done using the inverse filter of the distortion:

G
inverse (u, v) =
1

G(u,
v)

provided that G(u,
v)  0 for any (u, v).
• Then the restored image is:

  I  I !
 G
K

G
inverse
369
Blur Estimation
• An estimate of blur G might be obtainable.
• The inverse of low-pass blur is high-pass:
1.0
80
60
40
20
u
Gaussian distortion
0
u
Inverse filter
• At high frequencies the designer must be careful!
•
Note: The inverse takes value 1.0 at (u, v) = (0,0)
DEMO
370
Deblur - Missing Frequencies
• Unfortunately, things are not always so "ideal" in the
real world.
• Sometimes the blur frequency response takes zero
value(s).
• If
 , v ) = 0 for some (u , v ), then G

G(u
0
0
0
0
inverse (u0 , v 0 ) = 
which is meaningless.
371
Zeroed Frequencies
• The reality: any frequencies that are zeroed by a linear
distortion are unrecoverable in practice (at least by
linear means) - lost forever!
• The best that can be done is to reverse the distortion at
the non-zero values.
• Sometimes much of the frequency plane is lost. Some
optical systems remove a large angular spread of
frequencies:
unrecoverable
"zeroed" frequencies
(0, 0)
Frequency Domain
372
Pseudo-Inverse Filter
• The pseudo-inverse filter is defined



1/G(u,
v)
;
if
G(u,
v)  0

G p-inverse (u, v) = 

; if G(u,
v)  0
0
• Thus no attempt is made to recover lost frequencies.
• The pseudo-inverse is set to zero in the known
region of missing frequencies – a conservative
approach.
• In this way spurious (noise) frequencies will be
eradicated. DEMO (Noisy case: use snoise = 1, sblur < 1)
373
Deblur in the Presence of Noise
• A worse case is when the image I is distorted both by
linear blur G and additive noise N:
J = G I + N
• This may occur, e.g., if an image is linearly distorted
then sent over a noisy channel.
• The DFT:
  I + N

J = G
374
Filtering a Blurred, Noisy Image
• Filtering with a linear filter H will produce the result
or
K = H  J = H G I + H  N
  I + H
 H
  J = H
 G
 N

K
• The problem is that neither a low-pass filter (to
smooth noise, but won't correct the blur) nor a highpass filter (the inverse filter, which will amplify the
noise) will work.
375
Failure of Inverse Filter
• If the inverse filter were used, then
or
K = Ginverse  J = I + Ginverse  N

 G
  

K
inverse  J = I + G inverse  N
• In this case the blur is corrected, but the restored
image has horribly amplified high-frequency
noise added to it.
376
Wiener Filter
• The Wiener filter (after Norbert Wiener) or minimum-meansquare-error (MMSE) filter is a “best” linear approach.
• The Wiener filter for blur G and white noise N is

G
Wiener (u, v) =
  (u, v)
G
2

G(u, v) + η2
• Often the noise factor h is unknown or unobtainable. The
designer will usually experiment with heuristic values for h.
• In fact, better visual results may often be obtained by using
values for h in the Wiener filter.
377
Wiener Filter Rationale
• We won’t derive the Wiener filter here. But:
• If h = 0 (no noise), the Wiener filter reduces to the
inverse filter:
  (u, v)
G
1

G
Wiener (u, v) =

G(u,
v)
2
=

G(u,
v)
which is highly desirable.
378
Wiener Filter Rationale

v) = 1 for all (u, v) (no blur) the Wiener filter
• If G(u,
reduces to:

G
Wiener (u, v) =
1
1+ η2
which does nothing except scale the variance so
that the MSE is minimized.
• So, the Wiener filter is not useful unless there is
blur.
379
Pseudo-Wiener Filter
• Obviously, if there are frequencies zeroed by the linear
distortion G then it is best to define a pseudo-Wiener
filter:

 (u, v)
 G

;
if
G(u,
v)  0

2
2


G
(u,
v)
=
G(u,
v)
+
η

Wiener


; if G(u,
v) = 0
0
• Noise in the "missing region" of frequencies will be
eradicated.
• DEMO (sblur < 4, snoise < 10)
380
APPLICATION EXAMPLE:
OPTICAL SERIAL SECTIONING
• Optical systems often blur images:
visible light
blurred image
scene
optical system
one possible solution:
blurred image
digitize
"inverse blur"
computer program
deblurred image
381
Optical Sectioning Microscopy
incremental
vertical translation
of microscope
(step motor)
region of
best focus
• A very narrow-depth of field microscope. One
image taken at each focusing plane, giving a
sequence of 2-D images - or 3-D image of optical
density.
382
3-D Image of Optical Density
DFT


3-D image of optical density
Magnitude of 3-D DFT
383
3-D Optical System Analysis
•
•
In this system three effects occur:
(1) A linear low-pass distortion G.
(2) A large biconic region of frequencies
aligned along the optical axis is zeroed.
(3) Approximately additive white noise.
Items (1) and (2) shown using principles of
geometric optics. Item (3) shown empirically.
384
3-D Biconic Spread of Lost Frequencies
3D Frequency Domain
(0, 0, 0))
Region of zeroed 3-D frequencies
• Note that DC (u, v, w) = (0, 0, 0) is zeroed also.
Hence the background level (AOD) is lost.
• There is no linear filtering way to recover this
biconic region of frequencies.
385
3-D Restoration
• So: the 3-D images are blurred, have a large 3-D
region of missing frequencies, and are corrupted by
low-level white noise added.
• The processed results show the efficacy of
- pseudo-inverse filtering
- pseudo-Wiener filtering
applied to two optical sectioned 3-D images:
- a pollen grain
- a pancreas Islet of Langerhans (collection of cells)
• EXAMPLES
386
Filter Banks
387
Generic Filter Banks
H0
D
U
G0
H1
D
U
G1
Image
I
S
HP -1
D
Bandpass Down“analysis” samplers
filter bank
U
Image
Processing
Upsamplers
Reconstructed
Image

I
GP -1
Bandpass
“synthesis”
filter bank
•
By design of filters Hp the image can be recovered closely or exactly (a
“discrete wavelet transform” or DWT) … if nothing is done in-between.
•
There may or may not be down- and up-samplers.
388
Two Views of Filter Banks
• There are many ways of looking at filter banks.
• We will look at two:
(1) So-called perfect reconstruction filter banks. Typically maximally
sampled for efficiency.
(2) Filter banks, w/o perfect reconstruction and perhaps w/o sampling.
Frequency analyzer filter banks.
• The view (1) is important when complete signal integrity is
required, e.g., compression, denoising, etc
• The view (2) is useful/intuitive when doing image analysis.
• Filter banks can fall in either or both categories
389
Perfect
Reconstruction
Filter Banks
390
Up- and Down-Sampling
1D Down-sampler
I(i)
D
J(i)
J(i) = I(iD)
Throw away D-1
of every D samples)
•
•
•
•
1D Up-sampler
I(i)
U
L(i) =
L(i)
 I(i/U) ; i = kU

; else
 0
Insert U-1 zeros
after every sample
Usually D = U
NOT inverses of each other.
Down-sampling throws away information
Up-sampling does not add information
391
Sampling
• Critical sampling (D = P). Same
number of output points as input.
Possibly no info lost.
H0
D
H1
D
HP-1
D
Image
I
• Oversampling (D < P). More points
than input image. May provide
resilience.
• Undersampling (D > P). Fewer
points than input image. Info lost.
• Unsampled (D = 1). Highly
redundant.
392
Analysis Filters
• Typically have separable impulse responses:
Hp(i, j) = Hp(i)Hp(j)
so a filtered image is:
J(i, j) = Hp(i, j)*I(i, j)
= Hp(i)*[Hp(j)*I(i, j)]
a 1-D convolution along columns then rows (or vice-versa).
393
Analysis Filter Types
• Idea: divide the frequency band along each axis:
p=0
p=1
p=2
p=3
 (u)
H
p
P=4
0
u


N-1
• Analysis/process image information (compress, feature
extract, noise remove, etc) in each band.
394
Two Band Case
p=1
p=0
 (u)
H
p
0
“Low” band
P=2
“High” band
u


N-1
• The two-band case is easy to analyze.
• We can use it to efficiently build filter banks.
395
Two-band Decomposition
“Low” band along rows
Image
I
“Low” band
H0
2
2
H1
2
2
“High” band along rows
G0
G1
Reconstructed
Image
S

I
“High” band
• Naturally subsampling causes aliasing.
• However, by proper choice of filters Hp and Gp the aliasing
can be canceled.

• In fact they can be chosen so that I = I .
• Filters Hi and Gi must be length N = a multiple of 2.
• Repeat along columns
396
Dyadic Sampling
1D Down-sampler
I(i)
2
J(i)
J(i) = I(2i)
Throw away every other
sample
 = (1/2)  I(u)  I(u+N/2) 
J(u)


for 0  u  (N/2)-1
(length N/2)
1D Up-sampler
I(i)
2
L(i)
 I(i/2) ; i = 2k; k an integer
L(i) = 
0 ;
else

Insert a zero
after every sample
 = I(u)
L(u)
N
for 0  u  2N-1
(length 2N)
397
Perfect Reconstruction Filters (1D)

• If I = I then H0, H1 and G0, G1 are proper wavelet filters.
• They obey the perfect reconstruction property
 (u) + H
 (u) = 2 for all 0  u  N-1
 (u)G
 (u)G
H
0
0
1
1
and
 (u) + H
 (u) = 0 for all 0  u  N-1
 (u+N/2)G
 (u+N/2)G
H
0
0
1
1
• Perfect reconstruction is important for many applications:
when all the image information is needed.
398
PR Wavelet Filters (1D)
 (u) where H (N/2) = 0
• Given the LP analysis filter H0(i)  H
0
0
(zero at highest frequency)
• Define the LP synthesis filter G0(i) = H0(N-1-i) hence
 (u) = H
 (N - u)
G
0
0
• Define the HP analysis filter H1(i) = (-1)iH0(N-1-i) hence
 (u) = H
 (N/2 - u)
H
1
0
• And the HP synthesis filter G1(i) = (-1)iH0(i) hence
 (u) = H
 (u - N/2)
G
1
0
• The synthesis filters are reversed versions of the analysis
filters (hence mirror filters). The HP filters are frequencyshifted (quadrature) versions of the LP filters.
399
Perfect Reconstruction Condition
• In this case the perfect reconstruction condition reduces to:
2
2


H 0 (u) + H1 (u) = 2 for all 0  u  N-1
or
2
2


H 0 (u) + H 0 (N/2-u) = 2 for all 0  u  N-1
400
Comments
• Can show that the PR condition on H0 implies that the DWT
expansion has an orthogonal basis.
• Finite-length discrete PR filters exist: most notably the (lowpass)
Daubechies filters, which are maximally smooth at w = p.
• However, the wavelet filters in orthogonal DWT cannot be even
symmetric about their spatial center (i.e., not linear or zero phase)
• This advantage is regained using bi-orthogonal wavelets (later).
401
PR Wavelet Filters
• There are an infinite number of wavelet filters.
• We are interested in finite-length, PR, discrete wavelets.
• Can start with any low-pass filter with
 (0) = 2 and H
 (N/2) = 0
H
0
0
• Example:
(Haar 2-tap filter)
 1/ 2 ; i = 0

h 0 (i) =  1/ 2 ; i =1
 0; else

 1/ 2 ; i = 0

h1 (i) =  -1/ 2 ; i =1
 0; else

402
Subband Filters
• Daubechies 4-tap (D4) filter:
 1+

 3+
1 
h 0 (n) =
 34 2
 10

3 ; i =0
H 0 (w)
H1 (w)
G 0 (w)
G1 (w)
3 ; i =1
3 ; i=2
3 ; i=3
; else
DSFTs
Ingrid Daubechies
Daubechies orthonormal wavelets are popular
owing to their nice function approximation properties
(makes them good for compression, denoising, interpolation, etc)
403
Daubechies 4 Wavelet (D4) Filters
Analysis LP
Synthesis LP
Analysis HP
Synthesis HP
(impulse responses)
404
Daubechies 8 Wavelet (D8) Filters
Analysis LP
Synthesis LP
Analysis HP
Synthesis HP
(impulse responses)
405
Daubechies 16 Wavelet (D16) Filters
Analysis LP
Synthesis LP
Analysis HP
Synthesis HP
(impulse responses)
406
Daubechies 32 Wavelet (D32) Filters
Analysis LP
Synthesis LP
Analysis HP
Synthesis HP
(impulse responses)
407
Daubechies D40 Wavelet (D40) Filters
Analysis LP
Synthesis LP
Analysis HP
Synthesis HP
(impulse responses)
408
2-D Wavelet Decompositions
• 2-D analysis filters (implemented separably) decompose
images into high- and low-frequency bands.
• The information in each band can be analyzed separately.
(0, 0)
Low, Low
High, Low
Low, High
High, High
409
Multi-Band 1D Discrete Wavelet Transform
• Filter outputs are the wavelet
coefficients.
Low
• Subsampling each filter output
yields exactly NM nonredundant DWT coefficients.
The image can be exactly
reconstructed from them.
•
•
•
Sub-sampling heavier at lower
frequencies.
Multiple bands (> 2) created by
iterated filtering on the lowfrequency bands.
Apply to both the rows and
columns for 2D
NxM
Image
I
Low
High
H0
H1
2
2
Low
High
H0
H1
2
2
High
H0
H1
2
2
410
1D Wavelet Packet Transform
Image
I
Low
Low
High
H0
H1
2
2
Low
High
Low
High
H0
H1
H0
H1
2
2
2
2
High
Low
High
Low
High
Low
High
H0
H1
H0
H1
H0
H1
H0
H1
2
2
2
2
2
2
2
2
411
1-D Hierarchies
Discrete Wavelet Transform
(octave bandwidths)
Wavelet Packet Decomposition
(linear bandwidths)
412
Linear vs Octave Bands
• A bandpass filter with upper and lower bandlimits uhigh and ulow :
u low
uhigh
• Linear bandwidth is Blinear = uhigh - ulow (cy/image)
• Octave bandwidth (octaves) is Boctave = log2(uhigh) – log2(ulow)
• A one-octave filter has uhigh = 2ulow.
• The DWT hierarchy gives a filter bank with constant Boctave
413
2-D Wavelet Decompositions
• Frequency division usually done iteratively on the lowfrequency bands.
• Two examples and the DWPT:
(0, 0)
(0, 0)
(0, 0)
LLHL
HL
LLLH LLHH
LH
HH
"Pyramid" Wavelet
Transform
"Tree-Structured"
Wavelet Transform
Wavelet Packet
Transform
414
Pyramid Wavelet Decomposition
Image
Two-Level Wavelet Decomposition
415
1D Inverse DWT
2
2
G0
G1
S
2
2
G0
G1
S
2
2
G0
G1
S

I
416
1D Inverse DWPT
2
2
2
2
2
2
2
2
G0
G1
G0
G1
G0
G1
G0
G1
S
S
S
S
2
2
2
2
G0
G1
G0
G1
S
S
2
2
G0
G1
S

I
417
Frequency
Analyzer
Filter Banks
418
Frequency Analyzer Construction
• Ignore sub-sampling / perfect reconstruction (can still do both)
• Start with a lowpass filter H0(n) where
 (0) 1 and H
 (N/2)  0
H
0
0
• Form cosine-modulated versions:
Gp(n) = H0(n)cos(2pupn/N)
• Then
1 

  u + u 
G p (u)   H 0  u - u p   H
0
p 
2
are bandpass filters.
419
Frequency Analyzers
• There are many ways to choose the
bandpass filters.
• Things to decide include
–
–
–
–
–
Filter shapes
Filter center frequencies
Filter bandwidths
Filter separation
Filter orientations
420
Filter Shapes
• Infinite possibilities
• For image analysis the minimum-uncertainty gaussian
shape is often desirable.
• A 2-D bandpass filter with (shifted) gaussian shape:
2
2
2
2
2
2

2
ps
u

u

v

v

2
ps
u

u

v

v









p
p 
p
p  
1






 (u, v)  e
G

e

p
2


G(i, j)
G(u,
v)
and thus cosine-modulated in space:
  up
vp
1  i2  j2  /2 s2
G p (i, j) 
e
cos  2p  i 
2
2ps
M
 N

j 

• This is called a Gabor filter. We’ll see many of these later!
421
Filter Center Frequencies
and Bandwidths
• Option 1:
– Constant (linear) filter separation
– Constant linear bandwidths
• Option 2:
– Constant logarithmic (octave) filter separation
– Constant logarithmic (octave) bandwidths
422
Octave Filter Bank
• Given a 1st bandpass filter
G1(i) = H(i)cos(u1i) with c.f. u1 = (u1,hi + u1,lo)/2
• For p > 1, define Gp+1(i) = H(i/2) cos(2upi). Its center frequency is
2up and lower / upper bandlimits up+1,hi = 2up,hi and up+1,lo = 2up,lo
• The (octave) bandwidths are the same:
log2(2up,high/2up,low) = log2(up,high /up,low).
2up,lo
up,lo
up
up,hi
2up,hi
up+1 = 2up
423
Filter Separation
• The basic idea is to cover frequency space.
• With filter responses that are large enough – e.g., at half-peak
amplitude* or greater.
• Adjacent filters can be defined to intersect at half-peak. This
ensures high SNR. (Note special handling of “baseband” filter)
*Half-peak bandwidth is a common image processing bandwidth measure (as opposed to half-power / 3dB BW).
424
Half-Peak Intersection Example
• Consider a Gabor bandpass filter:
2
2


G p (u)  exp  2  ps   u  u p  


and adjacent filter 1 octave away:
2
2


G p 1 (u)  exp  2  ps / 2   u  2u p  


• Exercise: Show that these filters intersect at half-peak if
up  3
ln 2
2  ps 
2
• Also show that these filters have 1 octave (half peak) BW
425
Filter Orientations
• The tesselation of the 2-D frequency plane involves sets of filters
having equal radial c.f.s and equal orientations.
• The filters typically have equal radial and orientation BWs.
• Adjacent filters can again intersect at half-peak.
426
Constant Octave Gabor Filterbank
•
•
•
•
•
Real-valued filters
1 octave filters
Half-peak intersection
9 orientations
4 scales
•
•
•
•
•
•
Complex-valued filters
1 octave filters
Half-peak intersection
8 orientations
5 scales
“Extra” filters at corners
427
Convolutional Neural Networks
or
ConvNets
or
CNNs
428
ConvNets
• Actually an old idea from the 1990s generally attributed to Yann
Lecun.
• The simple idea is to limit the number of inputs from prior
layers that each neuron receives.
• This creates layers that are only partially connected instead of
fully connected.
429
Revisit the Multilayer Perceptron
•
Recall the basic MLP from earlier with a single hidden layer:
i(1)
i(2)
input i
i(3)
···
···
outputs y
i(P)
input
layer
weights
w2
output
weights layer
R

f   w q (r)i q (r) 
 r 1

w1
•
But let’s feed it image pixels, instead of “handcrafted” features.
•
For a small 1024 x 1024 (P = 220 pixels) image, the number of weights w2 in
the hidden layer is PQ, where Q is the number of nodes in the hidden layer. If P
= Q, that’s 240 = 1.1 x 1012  1.1 Trillion weights to optimize via backprop!!!
430
Convolutional Layer Idea
•
Suppose we only feed a few image pixels to each node:
0
i(1)
i(2)
input i
i(3)
0
i(P)
input
layer
weights
w2
···
···
outputs y
output
weights layer
• Notice the endpoint zero
padding.
• The nodes are identical.
• Same number of nodes as
inputs.
• The weight set used for the
hidden layer is called the
filter.
w1
•
In fact suppose we only feed 3 pixel values to each hidden node. Then if P = Q
there are only 3P weights to optimize.
•
For the same image that’s 3 x 220  3.1 Million … lots better but still many!
•
However, in a convolution layer, each group of weights are constrained to be
identical. So in this example there are only 3 different weights (+1 bias = 4).
431
Stride Length
•
Don’t have to have as many filters as inputs. Can skip by a fixed stride:
0
i(1)
i(2)
input i
i(3)
0
i(P)
input
layer
weights
w2
···
···
outputs y
output
weights layer
w1
•
Effectively subsampling the full output. Without subsampling, stride = 1. In
the above, stride = 2. Stride can be larger to reduce computation.
•
As before each node includes a nonlinear activation function.
432
2-D Convolutional “Receptive Field”
• Every node in hidden layer receives
inputs from a local neighborhood.
• These are weighted and summed.
• The weight set (filter) is identical
for each node.
• If stride = 1 identical to 2D
convolution with the filter.
Input could be
- image pixel values
Similar to feed-forward from
- retinal sensors
- neurons in cortex to other
neurons.
CNNs often called
“biologically inspired”
433
Stride of 2 (Most Common)
• In a 2-D image, if stride = S, then computation is divided by S2.
• Usually best to retain full resolution in the early layers (we will soon stack
these!) to extract as much low level feature information as possible. 434
Simpler Diagram
M
m
n
N
Image
Filter
Feature
Map
Sum &
Activation
Output (next layer)
• Output called a feature map. Fed to the next layer.
• When stacked, early layers extract low-level features.
• Zero padding the input ensures the feature map will also be N
x M (stride = 1)
435
Feature Maps
• A convolutional layer produced by a fixed filter only contains
one feature type – say oriented in some direction, at some scale.
• The network may “decide” to learn a specific 2D bandpass filter
(and in fact, they do!).
• One feature map not enough!
• Instead generate multiple (Q) feature maps! • Here Q = 6 feature maps.
• Each generated with a different
learned filter.
• This multiplies the computation.
• Also multiplies the data volume.
• But a lot more info is obtained!
M
m
N
n
Q filters
image
sums &
activations
Q feature maps
436
Stacking Layers
• For multiple layers simplify the notation, omitting flow lines, filters,
and sum/activation nodes.
• Example Network: We stacked a multiple convolutional layers
(each with multiple filters and activations). We have also added some
things. Let’s explain.
N N
  4Q
16 16
N N
  4Q
4 4
N N
  2Q
2 2
image
NxN(x3)
N N
  4Q
8 8
convolution
+ activation
N  N Q
pooling
1 1
2
1 1 K
N
Q
64
N N
  4Q
32 32
K
outputs
fully connected
+ activation
softmax
437
Multiple Channels
• The input is color, so has 3 color channels (such as RGB).
• Hence each filter does also: if a filter is P x P, then it is really a
3 x P x P filter.
• The diagram remains the same, but the computation is threefold.
N N
  4Q
16 16
N N
  4Q
4 4
N N
  2Q
2 2
image
NxN(x3)
N N
  4Q
8 8
convolution
+ activation
1 1
2
1 1 K
N
Q
64
N N
  4Q
32 32
K
outputs
fully connected
+ activation
N  N Q
pooling
softmax
438
Stacked Convolutions
• Stacking convolutions allows for small filter kernel (weights).
• Since repeated layers “reach” further to become less local.
• Similar to repeated linear filtering – except each layer has an
activation function.
image
NxNx3
N  N Q
N  N Q
439
Activation Functions
• Classic sigmoid functions can be used – but for modern
Convnets / CNNs, other activation functions are more
effective.
• Training multi-layered ConvNets requires backprop, which
uses gradient descent.
• Convergence problems arise, like the vanishing gradient
problem. If the gradient approaches zero, convergence slows
or halts!
440
Activation Functions
• The most popular is “ReLu,” or Rectified Linear Unit:
0 ; x  0
f(x)  
x ; x  0
0 ; x  0
f(x)  
1 ; x  0
• These A.F.’s can speed
up training many-fold.
• Vanishing gradient
reduced by not limiting
the range (e.g., to [0, 1])
• Another popular one is the Exponential Linear Unit (ELU):
 a(e x  1) ; x  0
f(x)  
x;x0

f(x)  a ; x  0
f(x)  
1; x  0

441
More Activation Functions
442
Pooling Layers
• Another way to reduce network complexity that is often used.
• Another way to expand the influence of each filter’s receptive
field across the image.
• Appropriate after the earliest layers when the network focus is
no longer on extracting local low-level features.
• Instead, the network begins to build higher-level abstractions.
443
Max and Average Pooling
• Simple: Partition the most recent feature map(s) into non-overlapping P x P
blocks (Stride = P). Usually P = 2.
• Max-Pool: Replace each P x P block with a single value max{block}:
12 20 30
4
8
4
12
2
2x2
Max-Pool
20 30
96 37
34 70 37 14
Tends to propagate sharp,
sparse features, like
edges and contours.
88 96 25 12
• Ave-Pool: Replace each P x P block with a single value ave{block}:
12 20 30
4
8
4
12
2
34 70 37 14
88 96 25 12
2x2
Ave-Pool
13 10
72 22
Tends to propagate a
similar, smooth
version of the prior
layer.
• Both help “over-fitting” by providing an abstracted form of the representation,
and reducing positional dependence of the representation (translation invariance).
444
Flattening and Fully Connected Layers
• The final stages of a CNN are usually fully connected networks or FCNs
(the same MLP’s as earlier) are applied.
• Since the data has been down-sampled and abstracted, it is manageable
with today’s hardware.
1 1
N N
  4Q
32 32
2
1  1 K
N
Q
64
K
outputs
• The output of the previous layer is first flattened into a 1-D vector, to
format it as input to the FCN. Everything is connected to everything, so
nothing (such as spatiality) is lost.
• The image features have been converted into higher-level abstractions, on
which the FC network performs the final inferencing.
445
Softmax
• The final FC network generates K outputs y = {y(k); k = 1, …, K},
corresponding to K classes in the problem to be solved.
• The softmax (or softargmax*) function converts these into probabilities by
mapping K non-normalized outputs of the last FC network to a probability
distribution over the K predicted output classes.
1 1 K
K
outputs
softmax
• If for image i(j) (train or test) the outputs are yk(j), k = 1, …, K, the softmax
outputs are
e yk ( j)
p k ( j) 

M
m 1
e
y m ( j)
for k 1, ..., K.
• These sum to one and may be thought of as probabilities.
*Suppose y(k0) = max{y(k); k = 1, …, K}. Then arg max{y(k); k = 1, …, K}
= {dk-k0); k = 1, …, K} = {0, 0, 0, …., 0, 1, 0, …., 0}.
446
Cross Entropy Loss Function
• Assume multi-class classification: Given a set of known
ground truth training labels T = {tj; 1 < j < J} of input training
images I = {i(j); 1 < j < J}.
• K = 1 is the simplest, e.g., face / no face.
• Each label takes one of K values tj {1, …, K} (multi-class).
For example, in images these could index {dog, cat, bird …}.
• Let the output (predicted class probabilities) of the J training
images i(j) be given by pk(j), k = 1, …, K.
• Then the cross-entropy loss or log-loss between ground truth
and predictions is given by
1 J K
CE    δ( j, k) log p k ( j)
J j1 k 1
• A good way to generate the probabilities pk(j) is softmax. 447
VGG-16
• You have learned all of the elements of a very famous deep learning network or
DLN called VGG-16.
• Designed for 1000-object classification, has done well in many image analysis
contests and remains very popular.
• Many variations and deeper ones, but these are the original “hyperparameters”:
1 11000
14  14  512
1 1 4096
56  56  256
image 224  224  64
224x224(x3)
•
7  7  512
28  28  512
112  112 128
In VVG nets, the filter size is 3x3
1000
outputs
convolution
+ activation
fully connected
+ activation
softmax
pooling
448
Comments
• Later we will use filter banks and ConvNets
extensively … onward to Module 5.
449
Module 5
Image Denoising & Deep
Learning





Median, Bilateral, Non-local Mean Filters
Wavelet Soft Thresholding
BM3D
Deep Learning, Transfer Learning
Autoencoders, Denoising Networks
QUICK INDEX
450
MEDIAN FILTER
• The median filter is a nonlinear filtering device that is related
to the binary majority filter (Module 2).
• Despite being “automatic” (no design), it is very effective for
image denoising.
• Despite its simplicity, it has an interesting theory that
justifies its use.
451
Median Filter
• Given image I and window B, the median
filtered image is:
J = MED[I, B]
• Each output is the median of the windowed set:
J(i, j) = MED{BI(i, j)}
= MED{I(i-m, j-n); (m, n)  B}
452
Properties of the Median Filter
• The median filter smooths additive white noise.
• The median filter tends not to degrade edges.
• The median filter is particularly effective for
removing large-amplitude noise impulses.
DEMO
453
BILATERAL FILTER
454
BILATERAL FILTER
• Another principled approach to image denoising.
• It modifies linear filtering.
• Like the median filter, it seeks to smooth while
retaining (edge) structures.
455
Bilateral Filter Concept
• Observation: 2-D linear filtering is a method of
weighting by spatial distances [i = (i, j), m = (m, n)]:
J G (i ) =  I(m) G  m  i

m
• If linear filter G is isotropic, e.g., a 2-D gaussian LPF*
G(i, j)  K G exp    i 2  j2  / sG2 
then m  i 
 m-i  +  n-j
2
2
is Euclidean distance.
*The value of KG is such that G has unit volume.
456
Bilateral Filter Concept
• New idea: Instead of weighting by spatial distance,
weight by luminance similarity
J H (i ) =  I(m) H  I  m   I  i  
m
• NOT a linear filter operation. H could be a 1-D
gaussian weighting function*
H(i)  K H exp  i 2 / sH2 
• Not very useful by itself.
*The value of KH is such that H has unit area.
457
Bilateral Filter Definition
• Combine spatial distance weighting with luminance
similarity weighting:
J(i ) =  I(m) G  m  i  H  I  m   I  i  
m
• Computation is straightforward and about twice
as expensive as just spatial linear filtering.
• However, this can be sped up by pre-computing all
values of H for every pair of luminances, and
storing them in a look-up table (LUT).
458
Example
• Noisy edge filtered by bilateral filter (Tomasi, ICCV 1998).
• Uses gaussian LPF and gaussian similarity
Before
Bilateral weighting 2
pixels to right of edge
After
459
Color Example
• Can extend to color using 3-D similarity weighting.
• If I = [R, G, B], then use
H[R(m)-R(i), G(m)-G(i), B(m)-B(i)]
where H(i, j, k)  K exp    i 2  j2  k 2  / s2 
H
Before

H

After
460
Comments on Bilateral Filter
• Similar to median filter, emphasizes luminances close
in value to current pixel.
• No design basis, so best used with care, e.g.,
retouching or low-level noise smoothing.
• The idea of doing linear filtering that is decoupled
near edges is very powerful.
• This is also the theme of the other methods.
461
NON-LOCAL (NL) MEANS
462
NL Means Concept
• Idea: Estimate the current pixel luminance as a
weighted average of all pixel luminances
• The weights are decided by neighborhood
luminance similarity.
• The weight changes from pixel to pixel.
• In all the following, i = (i, j), m = (m, n), p = (p, q).
• The method is expensive but effective.
463
NL Means Concept
• Given a window B, compute the luminance similarity
of every windowed set BI(i) with every other
windowed set BI(m) in the image:
W(m, i )  K W exp   BI  i   BI  m  / s2W 
where
BI  i   BI  m     I  i  p   I  m  p  
2
pB

 I  i  p, j  q   I  m  p, n  q 
(p,q)B
464
2
j
Moderately
similar
i
B I  i 
Highly
similar
Moderately
similar
Highly
dissimilar
465
NL Means Definition
• Given a window B, and the weighting function W just
defined, the NL-means filtered image I is
J  i    W  m, i  I  m 
m
• Thus each pixel value is replaced by a weighted sum of
all luminances, where the weight increases with
neighborhood similarity.
466
Example
Before
After
Difference
467
Example
Before
After
Difference 468
Comments
• NL-Means is very similar in spirit to frame averaging.
• The main drawback is the large computation / search.
• It can be modified to compare/search less of the image.
• It works best when the image contains a lot of
redundancy (periodic, large smooth regions, textures).
• It can fail at unique image patches.
469
Comparison
AWGN
Gaussian filtered
Bilateral filtered
NL-Means
470
BM3D
471
Overview
• A complicated algorithm – too much! Uses many of the ideas we
have covered.
• Uses concepts of NL Means, sparse coding, and more.
• Image broken into blocks, blocks then matched by similarity
into groups.
• Estimates are collaboratively formed for each block within each
group (collaborative filtering).
• This is done by a transformation of the group and a shrinkage
process.
• Perhaps the best denoising algorithm extant.
Dabov et al, “Image denoising by sparse 3D transform-domain collaborative filtering,”
IEEE Trans on Image Processing, Aug 2007.
472
Examples
Video
473
Deep Learning Networks
474
Deep Learning
• We have already arrived at Deep Learning! VGG-16 is a
classical Deep Learning neural net.
• Once it was regarded as “very deep,” but not so now.
• Notice in the diagram that there are no “handcrafted” features
extracted to feed the network. It is an “end-to-end” design,
meaning just pixels fed at one end, result at the other.
• Deep Learning Networks (DLNs) are just ConvNets, which
have been known about since the 1980s. So why the “hubbub”?
475
ImageNet Contest
• In the early 2000s, neural nets were still shallow (too much computation) and
still fed just a few extracted features.
• This changed when Geoffrey Hinton and his students Alex Krizhevsky and
Ilya Sutskever published a paper entitled
“ImageNet Classification with
Deep Convolutional Neural Networks”
• In 2012 they showed that an end-to-end 8-layer convnet (5 CNN and 3 FC)
significantly outperformed all computer vision image classifiers on the
standard ultimate testbed, “ImageNet.”
• ImageNet (at the time) contained 15 million human-labeled* images having
>20,000 class labels. They trained and tested on a subset of 1.2 Million
images having 1000 class labels.
• They accomplished this by adopting highly efficient Graphical Processing
Units (GPUs) originally designed for graphics/image manipulation.
*Labeled using Amazon Turk crowdsource platform
476
AlexNet
• Here is the DLN they used (from their paper):
•
13  13  384
4096  1
27  27  256
224x224(x3)
4096  1
13  13  256
13  13  384
55  55  96
Details
–
–
–
–
–
–
–
–
650,000 neurons and 60 million parameters
ReLu activation to avoid “overfitting”
Stride of 4 at input (input subsampling)
Maxpooling of layers 1, 2 and 5
1000-way softmax at output
477
Response normalization to [0, 1] after 1, 2, and 5
New method called “Drop-Out” to avoid overfitting
“Data Augmented” by also training on randomly shifted & horizontally flipped images
Overfitting
• A great problem is the lack of data needed to train large
networks and avoiding overfitting.
• Overfitting occurs when a network learns a model that is too
close to the training data, and therefore does not generalize
well to new data.
• Usually from too-little data (vs #network
parameters) or training for too long.
• Some methods of handling:
–
–
–
–
–
Data augmentation
Early stopping
Dropout
Batch normalization
Regularization
Blue and Red are training
data in two classes. Green
curve is overfitted. Black is
“regularized” to avoid that.
478
Handling Overfitting
• There is no simple rule of thumb regarding how much data is
needed to train a network with P parameters (although
sometimes say 1/10th #parameters at least).
• Data Augmentation. Increasing the amount of data artificially
(when enough real data is not available). This involves reusing
images by flipping, rotating, scaling, and adding noise to create
“new” images. Care is needed but it works!
• Early stopping. A lot of theory, but usually empirical / ad hoc
in application. Simple stop the training when some measure of
convergence is observed, typically.
• Dropout. Idea: Neighboring neurons rely on each other too
much, forming complex co-adaptations that can overfit,
especially in parameter-heavy FC layers. Method: For each
input, randomly fix P% of neurons to zero. In AlexNet, P = 50.
479
Batch Normalization
• Batch normalization. Extends the idea of input normalization which we
discussed briefly in context of MLPs.
• Suppose the input is i0(1), i0(2), …., i0(P). Then let
1 P
μ 0   i 0 (p)
P p 1
1 P
2
σ    i 0 (p)  μ 0 
P p 1
2
0
• Input normalization is training and testing on î0 (p) 
• In batch normalization, this is instead applied
i 0 (p)  μ 0
σ 02  ε
– At other (possibly all) network layer inputs
– Normalizing over all the data (batch) or over subsets (mini-batches)
• Many benefits:
–
–
–
–
Improves convergence
Reduces overfitting / improves generalization
Greatly reduces vanishing gradient problem
Often dropout not needed
480
Global Average Pooling (GAP)
• Much of overfitting occurs in the parameter-heavy FC layers. GAP replaces
them in a simple way by simply globally averaging each final feature map.
• Then input these responses to softmax.
• Greatly reduces overfitting while also greatly reducing model size.
• VGG-like network with GAP:
outputs
pooling
convolution
+ activation
GAP
softmax
481
Deep Image
Regression Networks
482
Classification vs Regression
• Computer vision (big early application of DLNs) models often classify
(what class of object does this image contain?)
• Image processing models often create image-sized results like denoised
images, compressed images, quality maps, and so on.
•
Many of these are “image-to-image” transformations.
•
This generally involves regression instead of classification.
•
In regression the goal is to predict numerical values (better pixels,
distances/ranges, quality etc) from an image input (RGB, luminance,
YUV, etc).
483
General Image Regression
•
Assume a large collection of “before” and “after” images. Call these
I = {I1, I2, …., IN} and J = {J1, J2, …., JN}. N could be in the millions.
•
Most often each “after” image Jn is a changed version of a “before”
image In, which are assumed to have a desirable appearance.
•
The “after” images are the result of processing (intentional or
otherwise), such as:
–
–
–
–
–
–
–
•
Noise added
Compression by an algorithm
Blur occurs
Smoke, fog, or dust appear
Pieces of the image are lost or cropped
Image is dark or night falls
Color is lost or distorted
The “after” images can be other things like distance, quality, style, but
this requires other inputs also (later).
484
Prediction
• Basic idea: Predict the original image (remove noise, blur, smoke,
compression) using a DLN.
• How: Train the DLN on many examples of imperfect images, using
original images as labels. Simplest: pixelwise labels.
• Result: DLN (hopefully) learns to predict the labels (the desired
changed images). Not much thinking, but also not so easy!
•
Since the result is of the same size as the original images, little or no
downsampling used.
• Example:
485
Loss Functions
•
Given “before” and “after” (distortion or change) images I = {I1, I2, ….,
IP} and J = {J1, J2, …., JP} as before.
•
The training network produces predictions F(Jn) of In. Optimizing the
network requires minimizing a loss like the MSE (or L2 loss):
MSE  F  J n   In
N
2
M
 ε 2  F  J n  ,I n    F J n (i, j)  I n (i, j)
2
i 1 j1
or (L2 loss):
N
M
MAE  F  J n   I n 1 = ε1  F  J n  , I n    F J n (i, j)   I n (i, j)
i 1 j1
• Both are common.
– L2 is more sensitive to “outliers” (wrong labels) while L1 is robust against them.
– L2 optimization is more stable and yields unique solutions. L1 less stable, and nonunique.
– L1 optimization often is sparse, yielding weights or “codes” that are mostly zeros.
•
When the appearance of the result matters, the Structural Similarity
Index (SSIM), which is differentiable, is often used:
SSIM  F  J n  , I n 
486
Residual Nets?
•
Again assume “before” and “after” images I = {I1, I2, …., IN} and J = {J1,
J2, …., JN}. A “residual network” is a simple and intuitive idea.
• Suppose: Instead of learning predictions F(Jn) of In, instead learn to
predict the residuals R(Jn) = F(Jn)  In b/w the “after” and “before”
images. The training residuals are the labels.
•
The loss would be
N
M
RES-MSE  ε R  R  J n     R  J n    J n (i, j)  I n (i, j) 
2
i 1 j1
N
M

  F  J n (i, j)   I n (i, j)   J n (i, j)  I n (i, j)
i 1 j1

2
• Residuals simpler, lower entropy (clustered around 0), easier to predict.
• Application Phase: Ooops! The network is trained to predict residuals,
which we do not have!
487
ResNet
• There is a way to apply the idea! Let the network be trained on the same
input (Jn) and output (In) training data. The goal is still to produce a
prediction F(Jn)  In, using a loss F  J n   I n
•
This network learns the same mapping:
Figure adapted from the famous paper “Deep
Residual Learning for Image Recognition”
by He, Zhang, Ren, and Sun
•
However, the weighting layers instead actually learn the residuals
G  J n   F  J n   J n  In  J n
• The secret: A “skip connection” whereby Jn bypasses one or more layers,
then added before activation.
•
This can be done every few layers.
488
34-layer (standard and ResNet)
489
ResNet
• ResNet is an extremely powerful and popular idea!
• One of the highest-citing papers ever!
• However, in practice it is used differently for even better
results.
• The residual idea can be applied throughout the network
(every few layers) powerfully reduce overfitting and
improve convergence.
• The ResNet idea is used in other network designs that use
different connectivities, scales, and skip connections, such as
DenseNet, and GoogLeNet/Inception. These keep
evolving.
490
ResNet
• ResNet has made possible very, very deep networks
(hundreds of layers deep), which was not possible because
of the degradation problem.
• The performance of standard networks decrease with great
depth, because of vanishing gradients, poor convergence,
worse accuracy.
• Developers now routinely use “ResNet-50,” “ResNet-101,”
and “ResNet-152,” for example.
• ResNet models have won nearly every image detection,
recognition, segmentation, classification, and regression
contest!
491
Transfer Learning
492
Transfer Learning
•
The biggest problem in deep learning is obtaining enough data. For very
deep networks, even when parameter-efficient, very large datasets are needed
which are often not available.
•
As we know they can be trained end-to-end w/o feature computation.
Only pixels need be fed, if the network is large enough and the data volume
is large enough. However, often there is not nearly enough data.
•
However, a property of DLNs trained for image processing is their
remarkable generality. This is very useful when there is not enough data.
•
These features can be remarkably generic! Can often use a trained network
(with learned feature outputs) to conduct another visual task.
•
This might involve “fine tuning” the “pre-trained” network and/or adding
additional layers (or an SVC/SVR) that are subsequently trained.
•
Called TRANSFER LEARNING.
493
Transfer Learning (Fine Tuning)
Source Task
(very large data, e.g. ImageNet)
Target Task
Task
(much less data)
+ labels
pre-train
+ new labels
fine-tune
DLN
Pretrained DLN
(Knowledge)
predictions
494
Transfer Learning (New Layers)
Source Task
(very large data, e.g. ImageNet)
Target Task
Task
(much less data)
+ labels
pre-train
DLN
(Knowledge)
+ new labels
fine-tune
Pretrained DLN
New Layers / SVR
predictions
495
Transfer Learning
•
Transfer learning is one of the most common ways to address most image
processing / analysis tasks.
•
Most of which have much more limited data than ImageNet.
•
Typical a network designer will begin with an AlexNet, VGG-16, ResNet-20,
-50, 100, 150, Inception/GoogLeNet, or any other large network.
•
Some of these now have 1000+ layers!
•
In application examples we will often see this.
496
DLN Training Environments
497
DLN Training Environments
• There are many programming environments for training DLNs.
• Some of these have become quite popular.
• These include
–
–
–
–
–
Tensorflow
Caffe
Keras
Pytorch
Matlab
• Currently, Facebook’s Pytorch is favored amongst DLN R&D
engineers.
498
Convolutional Autoencoders
499
Autoencoders
•
An autoencoder is a special variety of neural network (or DLN).
•
It is generally unsupervised!
•
They have the same architecture as ordinary MLPs or DLNs, but they can be
divided into two parts.
•
The first part is like an ordinary network: multiple layers of down sampling
to achieve efficient representation requiring much less data / dimensionality.
•
The second part follows with multiple layers of up sampling – generally
mirroring the front stage of layers.
•
The point of transition where efficiency is achieved is the “bottleneck.”
•
In our world, the output is an image that is intended to be close to or look like
the original. Hence the loss function measures this (MSE, SSIM, etc).
500
Convolutional Autoencoder
Bottleneck
Code
or
latent representation
Input
Image
convolution
+ activation
Encoder
Transformer
pooling
Decoder
Reconstructed
Image
upsampling
501
Convolutional Autoencoders
•
The basic idea is to learn an efficient, compact representation of the
structure of an image, from which it can be “reconstructed.”
•
It’s not just the spatial dimensions. In fact down/up sampling might be limited
or omitted. The number of filters also matter, i.e., the size of the whole
feature space at the bottleneck.
• Autoencoders can be
– Overcomplete: the number of features in the code exceeds the input data size
– Complete: the number of features in the code is the same as the input data size
– Undercomplete: the number of features in the code is less than the input data size
• All three are very useful for different tasks! However, without further
constraints the first two might learn the identity function (not so useful).
•
All can be useful with additional constraints on the code.
•
For efficiency, undercomplete autoencoders are of greater interest.
502
Up Sampling
• Methods of up sampling include
– Nearest neighbor (pixel replication)
– Max unpooling (requires recalling where corresp. max pool value came from)
– Interpolation (bilinear, bicubic, learned)
1
1
2
2
1
2
1
1
2
2
3
4
3
3
4
4
3
3
4
4
Nearest Neighbor Unpooling
503
Regularized Autoencoders
• Regularization is a way of constraining the code at the
bottleneck, for various purposes.
• This includes avoiding identity function, making the represent
efficient / small code, creating high-information features, etc.
• Given a network loss function
N
M
ε  F  J n  ,I n    F J n (i, j)   In (i, j)
2
i 1 j1
where In are input images and F(Jn) are corresponding output
images (predictions), then modify the loss by appending a
penalty on the bottleneck code (activations) Hn:
ε  F  J n  ,I n   λ  R  H n  .
504
Sparse Autoencoders
• Sparse regularization at the bottleneck is very commonly used:
K
R  Hn    Hn  k 
k 1
which is just the L1 norm of the bottleneck activations.
• L1-norm optimization causes the code to be sparse, in the
sense that many zero activations are created, leaving a very
sparse code of a few non-zero activations.
• The concept is very closely related to “lasso” optimization in
statistics (L1-L2 optimization).
• Actually, this sparsity constraint can be applied at any layer of
any DLN to enforce sparse representations (and they are).
• An obvious application of these ideas is image compression.
505
Typical Sparse Weights Visualized
• These look a lot like the Gabor functions and derivatives of
gaussians which are good approximations to the receptive
fields of human cortical neurons.
506
Case Study: Denoising
Autoencoder (DAE)
507
Denoising Autoencoder (DAE)
• A simple task is to adapt an autoencoder to perform denoising.
• Instead of optimizing output to be equal or similar to input, it is
optimize it to equal a noise-free version of it.
• It’s about training: Given many noise-free images {In}, make
noisy ones, e.g., add gaussian noise Î n  I n  N n or other kind of
noise, multiplicative, signal-dependent, simulated, etc.
• The autoencoder can be sparse, often producing better results.
508
Denoising Autoencoder
• Training Phase:
pristine
training
images
loss L
noise
noisy
training
images
autoencoder
• The network doesn’t have to be an autoencoder, of course. But
if so, it is a denoising autoencoder.
509
Denoising Autoencoder
• Application Phase:
autoencoder
noisy
images
“denoised”
images
(simulated idea)
• Since denoising does not seek an efficient (small)
representation, it turns out that overcomplete autoencoders do
a better job of denoising than undercomplete ones.
510
Simple Autoencoder Denoising Example
• Small, simple example of denoising images with an autoencoder.
• It operates on gray-scale 28x28 images.
• Architecture:
– Encoder: 2 conv layers, ReLu, 2x2 max pooling, 32 3x3 filters
each layer (image: 282 = 784 pixels; codesize: 7x7x32 = 1568)
– Decoder: basically reverses. Same size corresponding filters,
2x2 upsampling)
• Training:
• Images (28x28x1) of handprinted numerals 0-9, using 55,000
training samples with added noise, 10,000 test samples.
• Trained over 25 epochs using cross-entropy loss.
511
Note the nice convergence of both the training
study and the validation study. This is very good since it means
that the DAE is generalizing and not overfitting.
512
• Pretty good results for such a simple method, but simple DAEs
struggle on “harder” images. More advanced DAEs (“stacked”
autoencoders, etc) do better and the field is changing very fast!
• This simple implementation and two figures are from S. Malik’s
teaching website here (with more details).
513
Case Study: Deep Residual
Denoiser
514
Deep Residual Denoiser
• This is a very competitive method.
• Basic architecture (Denoising CNN, or DnCNN):
• It uses the similar concept as the ResNet to learn a residual
image (the noise).
515
DnCNN Denoiser
• Remember the RES-MSE which we could not use (no
residuals available)?
N
M
RES-MSE  ε R  R  J n     R  J n    J n (i, j)  I n (i, j)
2
i 1 j1
N
M

  F  J n (i, j)   I n (i, j)   J n (i, j)  I n (i, j)
i 1 j1
• This time we can have residuals!
• In are clean images, and Jn are noisy images. Train to
estimate the residuals (noise) Jn  In.
516

2
DnCNN Architecture
• Has D layers. In their design the filter size is 3x3. Observe
that at layer d the effective span is (2d+1)x(2d+1).
• They use D = 17 (35x35 effective largest filters) to be
comparable with leading algorithms.
• They use ReLu but no pooling / downsampling.
• Filters:
– Layer 1: 64 filters (3x3xc, c = # colors)
– Layers 2 – (D-1): 64 filters (3x3x64 w/batch normalization)
– Layer D: c filters (3x3x64)
517
Numerical Predictive Performance
• In terms of PSNR.
• Four plots:
– With/without batch normalization (BN)
– With/without residual learning (RL) idea (versus just estimating In)
– Clearly shows great benefit of both, especially combined
518
Some Examples
• Trained tested on various datasets. See the paper.*
Original
Noisy
Color BM 3D
Color DnCNN
*Zhang et al, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on
Image Processing, Feb. 2017.
519
Some Examples
• They also tried it on JPEG blocking “noise,” and compared
against state-of-the-art networks “AR-CNN” and “TNRD” (see the
paper):
Original
AR-CNN
TNRD
DnCNN
• The topic of “de-blocking” is also a hot research area. They used
this to show the generality of their method.
520
Comments
• We now move to a big topic – image
compression… onward to Module 6..
521
Module 6
Image Compression




Lossless Image Coding
Lossy Image Coding
JPEG Image Compression Standards
Deep Compression
QUICK INDEX
522
OBJECTIVES OF IMAGE
COMPRESSION
• Create a compressed image that "looks the same" (when
decompressed) but can be stored in a fraction of the space.
• Lossless compression means the image can be exactly
reconstructed.
• Lossy compression means that visually redundant
information is removed. The decompressed image "looks"
unchanged, but mathematically has lost information.
• Image compression is important for:
- Reducing image storage space
- Reducing image transmission bandwidth
523
Significance for Storage & Transmission
This ...
workstation
sequence of
transmitted
images
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
HD
communication channel:
wire
airwaves
optical fiber
etc.
... becomes this.
workstation
HD
sequence of
received
images
Compressed images can be sent at an increased rate: More images per second !
524
Video phones and
video watches as
conceived in 1930s
525
Screens are Getting BIGGER
The “Flexpai”
Roll-up TV
526
527
Lossless Image Compression
528
Image Compression Measures
• Bits Per Pixel (BPP) is the average number of
bits required to store the gray level of each pixel
in I.
• Uncompressed image (K is gray scale range):
BPP(I) = log2(K) = B
• Usually
B = log2(256) = 8.
529
Variable-Bit Codes
• The number of bits used to code pixels may spatially vary.

• Suppose that I is a compressed version of I.
• Let B(i, j) = # of bits used to code pixel I(i, j). Then

BPP( I ) =
1 N-1 M-1
B(i, j)


NM i=0 j=0

• If total number of bits in I is Btotal, then

BPP( I ) =
1
Btotal
NM
530
Compression Ratio
• The Compression Ratio (CR) is
BPP(I)
CR =
 1
BPP( I )
• Both BPP and CR are used frequently.
531
LOSSLESS IMAGE COMPRESSION
• Lossless techniques achieve compression with no loss of information.
•
The true image can be reconstructed exactly from the coded image.
•
Lossless coding doesn’t usually achieve high compression but has
definite applications:
- In combination with lossy compression, multiplying the gains
- In applications where information loss is unacceptable.
•
Lossless compression ratios usually in the range
2:1 ≤ CR ≤ 3:1
but this may vary from image to image.
532
Methods for Lossless Coding
• Basically amounts to clever arrangement of
the data.
• This can be done in many ways and in many
domains (DFT, DCT, wavelet, etc)
• The most popular methods use variable
wordlength coding.
533
Variable Wordlength Coding
(Huffman Code)
534
Variable Wordlength Coding (VWC)
• Idea: Use variable wordlengths to code gray levels.
• Assign short wordlengths to gray levels that occur
frequently (redundant gray levels).
• Assign long wordlengths to gray levels that occur
infrequently.
• On the average, the BPP will be reduced.
535
Image Histogram and VWC
• Recall the image histogram HI:
H I (k)
0
K-1
gray level k
• Let L(k) = # of bits (wordlength) used to code graylevel k. Then

1 K-1
BPP(I ) =
L(k)H I (k)

NM
k=0
• This is the common measure of BPP for VWC.
536
Image Entropy
• Recall the normalized histogram values
pI (k) =
1
H I (k) ; k = 0,..., K-1
NM
so pI(k) = probability of gray level k
• The entropy of image I is then
K-1
E[I] = -  pI (k) log2 pI (k)
k=0
537
Meaning of Entropy
• Entropy is a measure of information with nice
properties for designing VWC algorithms.
• Entropy E[I] measures
– Complexity
– Information Content
– Randomness
538
Maximum Entropy Image
• The entropy is maximized when
1
pI (k) =
; k = 0,..., K-1
K
corresponding to a flat histogram. Then (K = 2B)
K-1
K-1 1
1
1
E[I] = -  log 2 = B  
=B
K
k=0 K
k=0 K
• Generally, image entropy increases as the
histogram is spread out.
539
Minimum Entropy Image
• The entropy is minimized when
pI (k n ) = 1 for some k n ; 0  k n  K-1
hence pI(km) = 0 for m ≠ n.
• This is a constant image and E [I] = 0.
• In fact it is always true that
0 ≤ E [I] ≤ B
540
Significance of Entropy
•
An important theorem limits how much an image can be compressed by a
lossless VWC:

BPP( I )  E  I
•
A VWC is assumed uniquely decodable.
•
The pI(k) are assumed known at both transmitter and receiver.
•
Thus an image with a flat histogram: E[I] = B cannot be compressed by a
VWC alone. Fortunately, we can often fix this situation by entropy
reduction.
•
Moreover, a constant image need not be sent: E[I] = 0!
•
The entropy provides a compression lower bound as a target.
541
Image Entropy Reduction
• Idea: Compute a new image D with a more compressed
histogram than I but with no loss of information.
Lossless invertible
transformation
H I (k)
high entropy image
H D (k)
reduced entropy image
• Must be able to recover I exactly from D - no loss of
information.
• Compressing the histogram (Module 3) won’t work, since
information is lost when two gray levels k1 and k2 are
mapped to the same new gray level k3.
542
Entropy Reduction by
Differencing (Simple DPCM)
• Differential pulse-code modulation (DPCM) is effective for lossless image
entropy reduction.
•
Images are mostly smooth: neighboring pixels often have similar values.
•
Define a difference image D using either 1-D or 2-D differencing:
(1-D) D(i, j) = I(i, j) - I(i, j-1)
(2-D) D(i, j) = I(i, j) - I(i-1, j) - I(i, j-1) + I(i-1, j-1)
for 0 ≤ i ≤ N-1, 1 ≤ j ≤ M-1.
•
The new histogram HD usually will be more compressed than HI, so that
E[D] < E[I]
•
This is a rule of thumb for images, not a math result.
543
Reversing DPCM
•
If (1) is used, then
I(i, j) = D(i, j) + I(i, j-1)
•
The first column of I must also be transmitted.
•
If (2) is used, then
I(i, j) = D(i, j) + I(i-1, j) + I(i, j-1) - I(i-1, j-1)
where the first row and first column of I must also be transmitted.
•
The overhead of the first row and column is small, but they can be
separately compressed.
• Hereafter we will assume an image I has either been histogram
compressed by DPCM or doesn’t need to be.
DEMO
544
Optimal Variable Wordlength Code
• Recall the theoretical lower bound using a VWC:
K-1

BPP( I )  E[I ] = -  pI (k) log2 pI (k)
k=0
• Observe: In any VWC, coding each gray-level k using
wordlength L(k) bits gives an average wordlength:
 K-1
BPP( I )=  pI (k) L(k)
k=0
• Compare with the above. If L(k) = -log2pI(k) then an
optimum code has been found - lower bound attained!
545
Optimal VWC
• IF we can find such a VWC such that
L(k) = -log2pI(k) for k = 0 ,…, K-1
then the code is optimal.
• It is impossible if -log2pI(k) ≠ an integer for some k.

• So: define an optimal code I as one that satisfies:



- BPP( I ) ≤ BPP( I) of any other code I

- BPP( I ) = E[I] if [-log2pI(k)] = integers
546
The Huffman Code
• The Huffman algorithm yields an optimum code.
• For a set of gray levels {0 ,.., K-1} it gives a set of
code words c(k); 0 ≤ k ≤ K-1 such that

BPP( I )=
K-1
 pI (k) L c(k)
k=0
is the smallest possible.
547
Huffman Algorithm
• Form a binary tree with branches labeled by the gray-levels km
and their probabilities pI (km) :
(0) Eliminate any km where pI (km) = 0.
(1) Find 2 smallest probabilities pm = pI(km), pn = pI (kn).
(2) Replace by pmn = pm + pn to form a node; reduce list by 1
(3) Label the branch for km with (e.g.) '1' and for kn with '0'.
(4) Until list has only 1 element (root reached), return to (1).
• In step (3), values '1' and '0' are assigned to element pairs (km, kn),
elements triples, etc. as the process progresses.
548
Huffman Tree Example 1
• There are K = 8 values {0 ,.., 7} to be assigned
codewords:
pI(0) = 1/2
pI(1) = 1/8
pI(2) = 1/8
pI(3) = 1/8
pI(4) = 1/16
pI(5) = 1/32
pI(6) = 1/32
pI(7) = 0
• The process creates a binary tree, with values '1' and '0'
placed (e.g.) on the right and left branches at each stage:
549
Huffman Tree Example 1
k
p I (k)
0
1
2
1
2
1
8
3
1
8
1
8
4
5
1
16
1
32
1/4
1 0
6
1
32
1/16
1 0
1/8
1 0
1/4
1 0
1/2
1 0
1
1 0
c(k)
1
011
010
001
L[c(k)]
1
3
3
3
0001 00001 00000
4
5
5
BPP = Entropy = 2.1875 bits
CR = 1.37 : 1
550
Huffman Tree Example 2
• There are K = 8 values {0 ,.., 7} to be assigned
codewords:
pI(0) = 0.4
pI(1) = 0.08
pI(2) = 0.08
pI(3) = 0.2
pI(4) = 0.12
pI(5) = 0.08
pI(6) = 0.04
pI(7) = 0.00
551
Huffman Tree Example 2
k
0
1
2
3
4
5
6
p I (k)
0.4
0.08
0.08
0.2
0.12
0.08
0.04
0.16
1 0
0.12
1 0
0.36
1 0
0.24
1 0
0.6
1 0
1 0
c(k)
1
0111
L[c(k)]
1
4
0110
010
001
4
3
3
BPP = 2.48 bits
Entropy = 2.42 bits
CR = 1.2 : 1
0001 0000
4
4
552
Huffman Decoding
• The Huffman code is a uniquely decodable
code. There is only one interpretation for a
series of codewords (series of bits).
• Decoding progresses by traversing the tree.
553
Huffman Decoding Example
•
In the second example, this sequence is received:
00010110101110000010000100010110111010
•
It is sequentially examined until a codeword is identified. This continues until all
are identified:
0001 0110 1 0111 0000 010 0001 0001 0110 1 1 1 010
•
The decoded sequence is:
5 2 0 1 6 3 5 5 2 0 0 0 3
k
0
1
2
3
4
p (k)
I
0.4
0.08
0.08
0.2
0.12
5
6
0.08
0.16
0.04
0.12
1 0
1 0
0.36
0.24
1 0
1 0
0.6
1 0
1 0
554
Comments on Huffman Coding
• Huffman image compression usually
2:1 ≤ CR ≤ 3:1
• Huffman codes are quite noise-sensitive (much more so
than an uncompressed image).
• Error-correction coding can improve this but increases
increase the coding rate somewhat.
• Some Huffman codes aren’t computed from the image
statistics - instead assume “typical” frequencies of
occurrence of values (gray-level or transform values) – as
in JPEG.
555
Exercise
• Work through the two examples just given.
• But re-order the probabilities in various ways so that
they are not “descending.”
556
Lossy Image Compression
557
LOSSY IMAGE CODING
• Many approaches proposed.
• We will review a few popular approaches.
- Block Truncation Coding (BTC) – simple, fast
- Discrete Cosine Transform (DCT) Coding and
the JPEG Standard
- Deep Image Compression
558
Goals of Lossy Coding
• To optimize and balance the three C’s
– Compression achieved by coding
– Computation required by coding and decoding
– Cuality of the decompressed image
559
Broad Methodology of Lossy
Compression
• There have been many proposed lossy compression
methods.
• The successful broadly (and loosely) follow three steps.
(1) Transform image to another domain and/or extract
specific features. (Make the data more compressible, like DPCM)
(2) Quantize in this domain or those features. (Loss)
(3) Efficiently organize and/or entropy code the
quantized data. (Lossless)
560
Block Coding of Images
• Most lossy methods begin by partitioning the
image into sub-blocks that are individually coded.
Wavelet methods are an exception.
MxM
MxM
MxM
MxM
MxM
MxM
MxM
MxM
MxM
561
Why Block Coding?
• Reason: images are highly nonstationary:
different areas of an image may have different
properties, e.g., more high or low frequencies,
more or less detail etc.
• Thus, local coding is more efficient. Wavelet
methods provide localization without blocks.
• Typical block sizes: 4 x 4, 8 x 8, 16 x 16.
562
Block Truncation Coding
(BTC)
563
Block Truncation Coding (BTC)
• Fast, but limited compression.
• Uses 4 x 4 blocks each containing 16 · 8 = 128 bits.
• Each 4 x 4 block is coded identically so no need to
index them.
• Consider the coding of a single block which we will
denote {I(1) ,..., I(16)}.
564
BTC Coding Algorithm
(1) Quantize block sample mean:
1 16
I=
I(p)

16 p=1
and transmit/store it with B1 bits.
(2) Quantize block sample standard deviation:
1 16
2
σI =
I(p)
I




16 p=1 
and transmit/store it with B2 bits.
565
BTC Coding Algorithm
• Compute a 16-bit binary block:
 1 ; if I(p)  I
b(p) = 
0 ; if I(p)  I
121 114 56 47
1
1
0
0
37 200 247 255
0
1
1
1
0
0
0
1
0
0
0
1
16
0
12 169
43
5
7 251
I = 98.75
which requires 16 bits to transmit/store.
566
BTC Quantization
• The quality/compression of BTC-coded images depends on the
quantization of mean and standard deviation.
• If B1 = B2 = 8, 32 bits / block are transmitted:
CR = 128 / 32 = 4:1
• OK quality if B1 = 6, B2 = 4 (26 bits total):
CR = 128 / 26 ≈ 5:1
• Using B1 > B2 is OK: the eye is quite sensitive to the presence of
variation, but not to the magnitude of variation:
Mach Band Illusion
567
BTC Block Decoding
• To form the "decoded" pixels J(1) ,..., J(16):
(1) Let Q = number of '1's, P = number of '0's in binary block.
 b(p) =1 ; set J(p)  I + σ I /A
(2) If 
 b(p) =0 ; set J(p)  I - σ I  A
where
A=
Q
P
• It is possible to show that this forces J  I and σ J  σ I
568
BTC Block Decoding Example
• In our example: Q = 7, P = 9, A = 0.8819
I = INT 98.75 + 0.5 = 99
σ I = INT 92.95 + 0.5 = 93
• If  b(p) =1 ; set J(p)  99 + INT 93/0.882+0.5  204
 b(p) =0 ; set J(p)  99 - INT 93  0.882+0.5 = 17
204 204 17
DEMO
17
17
204 204 204
17
17
17
204
17
17
17
204
Here J  98.8
σ J  77.3
569
Comments on BTC
• Attainable compressions by simple BTC in
the range 4:1 – 5:1.
• Combining with entropy coding of the
quantized data, in the range 10:1.
• Popular in low-bandwidth, low-complexity
applications. Used by the Mars Pathfinder!
570
571
JPEG
572
JPEG
• The JPEG Standard remains the most widely-used method of
compressing images.
• JPEG = Joint Photographic Experts Group, an international
standardization committee.
• Standardization allows device and software manufacturers to
have an agreed-upon format for compressed images.
• In this way, everyone can “talk” and “share” images.
• The overall JPEG Standard is quite complex, but the core
algorithm is based on the Discrete Cosine Transform (DCT).
573
Discrete Cosine Transform (DCT)
• The DCT of an N x M image or sub-image:
N-1 M-1
  2i+1 uπ 
  2j+1 vπ 
I(u, v) = 4C N (u)C M (v)
I(i, j) cos 
cos 




NM
2N
2M
i=0 j=0




• The IDCT:
  2i+1 uπ 
  2j+1 vπ 


I(i, j) =   C N (u)CM (v) I(u, v) cos 
cos 


2N
2M
u=0 v=0




where
 1
;u=0
2
C N (u) = 
574
 1 ; u = 1 ,..., N-1
N-1 M-1
DCT Basis Functions
• Displayed as 8 x 8 images:
575
DCT vs. DFT
• JPEG is based on quantizing and re-organizing block DCT
coefficients.
• The DCT is very similar to the DFT. Why use the DCT?
• First, O(N2logN2) algorithms exist for DCT - slightly faster than
DFT: all real, integer-only arithmetic.
• The DCT yields better-quality compressed images than the
DFT, which suffers from more serious block artifacts.
• Reason: The DFT implies the image is N-periodic, whereas the
DCT implies the once-reflected image is 2N-periodic.
576
Periodicity Implied by DFT
• A length-N (1-D) signal:
N
• Periodic extension implied by the DFT:
N
577
Periodicity Implied by DCT
• Periodic extension of reflected signal implied by the
DCT:
2N
• In fact (good exercise): The DFT of the length-2N
reflected signal yields (two periods of) the DCT of the
length-N signal.
578
Quality Advantage of DCT Over DFT
• The periodic extension of the DFT contains high
frequency discontinuities.
• These high frequencies must be well-represented
in the code, or the edge of the blocks will degrade,
creating visible blocking artifacts.
• The DCT does NOT have meaningless
discontinuities to represent - so less to encode, and
so, higher quality for a given CR.
579
580
Overview of JPEG
• The commercial industry standard - formulated by the CCIT
Joint Photographic Experts Group (JPEG).
• Uses the DCT as the central transform.
• Overall JPEG algorithm is quite complex. Three components:
JPEG baseline system (basic lossy coding)
JPEG extended features (12-bit; progressive; arithmetic)
JPEG lossless coder
581
JPEG Standard Documentation
• The written standard is gigantic. Best
resource (Pennebaker and Mitchell):
582
JPEG Baseline Flow Diagram
DCT
Image I partitioned
into 8x8 blocks
Quantization
Rearrange Data
(Zig-Zag)
Quantization
Tables
Huffman Coding
Tables
DC Coefficients
AC Coefficients
DPCM
Compressed data
Entropy
Coding
RLC
583
JPEG Baseline Algorithm
(1) Partition image into 8 x 8 blocks, transform each
I (u, v)
block using the DCT. Denote these blocks by: k
(2) Pointwise divide each block by an 8 x 8 user-defined
normalization array Q(u, v). This is stored /
transmitted as part of the code. Usually designed
using sensitivity properties of human vision.
(3) The JPEG committee performed a human study on the
visibility of DCT basis functions under a specific
viewing model.
584
JPEG Baseline Algorithm
(3) Uniformly quantize the result
I (u, v)


I Q (u, v) = INT k
 0.5

k
 Q(u, v)

yielding an integer array with many zeros.
585
JPEG Quantization Example
• A block DCT (via integer-only algorithm)
I k 
1260 –1 –12
–23 –17 –6
–11 –9 –2
–7 –2
0
–1 –1
1
2
0
2
–1
0
0
–3
2 –4
–5
–3
2
1
2
0
–1
–2
2
–3
0
1
0
–1
0
2
–2
0
–1
0
–1
1
2
1
–3
0
–1
0
1
1
1
–1
1
–1
0
0
1
–1
–1
0
• Typical JPEG normalization array:
Q
16
12
14
14
18
24
49
72
11
12
13
17
22
35
64
92
10
14
16
22
37
55
78
95
16 24 40 51 61
19 26 58 60 55
24 40 57 69 56
29 51 87 80 62
56 68 109 103 77
64 81 104 113 92
87 103 121 120 101
98 112 100 103 99
586
JPEG Quantization Example
• The resulting quantized DCT array:
I Qk =
79 0 –1
–2 –1 0
–1 –1 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
• Notice all the zeros. The DC value IkQ (0, 0) is in the
upper left corner.
• This works because the DCT has good energy
compaction, leading to a sparse image representation.
587
Data Re-Arrangement
• Rearrange quantized AC values
I Q (u, v) ; (u, v)  (0, 0)
k
• This array contains mostly zeros, especially at high
frequencies. So, rearrange the array into a 1-D
vector using zig-zag ordering:
0
2
3
9
10
20
21
35
1 5 6 14 15 27 28
4 7 13 16 26 29 42
8 12 17 25 39 41 43
11 18 24 31 40 44 53
19 23 32 39 45 52 54
22 33 38 46 51 55 60
34 37 47 50 56 59 61
36 48 49 57 58 62 63
588
Data Re-Arrangement Example
• Reordered quantized block from the
previous example:
[ 79
0
-2
-1
-1
-1
0
0
-1
(55 0’s) ]
• Many zeros.
589
Handling of DC Coefficients
• Apply simple DPCM to the DC values IkQ (0, 0) between
adjacent blocks to reduce the DC entropy.
• The difference between the current-block DC value and the
left-adjacent-block DC value is found:
e(k) = I Q (0, 0) - I Q (0, 0)
k
k-1
• The differences e(k) are losslessly coded by a lossless JPEG
Huffman coder (with agreed-upon table).
• The first column of DC values must be retained to allow
reconstruction.
590
Huffman Coding of DC Differences
•
A Huffman Code is generated and stored/sent indicating which category
(SSSS) the DC value falls in (See Pennebaker and Mitchell for the Huffman
codes).
• SSSS is also the # bits allocated to code the DPCM differences and sign.
591
Run-Length Coding of AC
Coefficients
• The AC vector contains many zeros.
• By using RLC considerable compression is attained.
• The AC vector is converted into 2-tuples (Skip, Value),
where
Skip = number of zeros preceding a non-zero value
Value = the following non-zero value.
• When final non-zero value encountered, send (0,0) as
end-of-block.
592
Huffman Coding of AC Values
•
•
•
•
The AC pairs (Skip, Value) are coded using another JPEG Huffman coder.
A Huffman code represents which category (SSSS) the AC pair falls in.
RRRR bits are used to represent the runlength of zeros, SSSS additional bits are used
to represent the AC magnitude and sign.
See Pennebaker and Mitchell for details!
593
JPEG Decoding
• Decoding is accomplished by reversing the
Huffman coding, RLC and DPCM coding to
recreate I Qk
• Then multiply by the normalization array
to create the lossy DCT
 Q
I lossy
=
Q

I
k
k
• The decoded image block is the IDCT of
the result lossy
  lossy 
Ik
= IDCT I k


594
JPEG Decoding
• The overall compressed image is recreated by putting
together the compressed 8 x 8 pieces:

Ilossy =  Ilossy
k

• The compressions that can be attained range over:
8 : 1 (very high quality)
16 : 1 (good quality)
32 : 1 (poor quality for most applications)
595
JPEG Examples
Original
512 x 512
Barbara
596
16 : 1
597
32 : 1
598
64 : 1
599
Original
512 x 512
Boats
600
16 : 1
601
32 : 1
602
64 : 1
DEMO
603
COLOR JPEG
• Basically, RGB image is converted to YCrCb image
(where Y = intensity, CrCb = chrominances)
• Each channel Y, Cr, Cb is separately JPEG-coded (no
cross-channel coding)
• Chrominance images are subsampled first. This is a
strong form of “pre-compression”!
• Different quantization / normalization table is used.
604
Digital Color IMages
• YCrCb is the modern color space used for digital images and videos.
• Similar but simpler definition:
Y = 0.299R + 0.587G + 0.114B
Cr = R  Y
Cb = B  Y
•
Used in modern image and video codecs like JPEG and H.264.
• Why use YCrCb? Reduced bandwidth. Chrominance info can be sent in a
fraction of the bandwidth of luminance info.
•
In addition to color information being “lower bandwidth,” the chrominance
components are entropy reduced.
DEMO
605
Chroma Sampling
606
Chroma Sampling
• Since color is a regional attribute of space, it can be
sampled more heavily than luminance.
• Image and video codecs almost invariably subsample
the chromatic components prior to compression.
• While not always referred in this way, this is really the
first step of compression.
• Taken together, color differencing to reduce entropy,
and color sampling both significantly improve image
compression performance.
607
Chroma Sampling Formats
• Modern chroma sampling formats:
 4:4:4 – sampling rates of Y, Cr, Cb are the same
 4:2:2 – sampling rates of Cr, Cb are ½ that of Y
 4:1:1 – sampling rates of Cr, Cb are ¼ that of Y
 4:2:0 – sampling rates of Cr, Cb are also ¼ that of Y
608
Y-Cb-Cr Sampling
• Used in JPEG and MPEG.
= Cr, Cb samples
= Y samples
4:4:4
4:2:2
4:1:1
4:2:0 (JPEG)
• Note: The chroma 4:2:0 samples are just calculated
for storage or transport. For display, 4:2:2, 4:1:1,
and 4:2:0 samples are interpolated back to 4:4:4.
609
RGB  YCrCb Examples
RGB
Y
Cr
Cb
YCbCr 4:2:0
610
RGB  YCrCb Examples
RGB
Cr
Y
YCbCr 4:2:0
Cb
611
RGB  YCrCb Examples
Cr
RGB
Y
YCbCr 4:2:0
Cb
612
RGB  YCrCb Examples
RGB
Cr
Y
YCbCr 4:2:0
Cb
613
JPEG Color Quantization Table
Q
17
18
24
47
99
99
99
99
18
21
26
66
99
99
99
99
24
26
56
99
99
99
99
99
47
66
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
99
• Sparser sampling and more severe quantization of the
chrominance values is acceptable.
• Color is primarily a regional attribute and contains less detail
information.
614
Color JPEG Examples
2.6:1
23:1
15:1
46:1
144:1
615
Wavelet Image Compression
616
Wavelet Image Coding
• Idea: Compress images in the wavelet domain.
• Image blocks are not used. Hence: no blocking
artifacts!
• Forms the basis of the JPEG2000 standard.
• Also requires perfect reconstruction filter banks.
617
Compression Via DWT
H0
D
U
G0
H1
D
U
G1
Image
I
S
HP -1
D
U

I
GP -1
Coefficient
quantization
Filters Hp form a orthogonal or bi-orthogonal
basis with reconstruction filters Gp.
618
Wavelet Decompositions
• Recall that wavelet filters separate or decompose the
image into high- and low-frequency bands, called
subbands. Each frequency band can be coded separately.
(0, 0)
Low, Low
High, Low
Low, High
High, High
• These are called quadrature filters.
619
Wavelet Decompositions
• Typically frequency division is done iteratively on
the lowest-frequency bands:
(0, 0)
(0, 0)
LLHL
HL
LLLH LLHH
LH
HH
"Pyramid" Wavelet
Transform
"Tree-Structured"
Wavelet Transform
620
Wavelet Filter Hierarchy
• Filter outputs are wavelet
coefficients.
Low
• Subsampling yields NM nonredundant coefficients. The
image can be exactly
reconstructed from them.
• Idea: Wavelet coefficients
quantized (like JPEG): higher
frequency coefficients
quantized more severely.
• The JPEG2000 Standard is
wavelet based.
Image
I
Low
High
H0
H1
2
2
Low
High
H0
H1
2
2
High
H0
H1
2
2
621
Bi-Orthogonal Wavelets
• Recall the perfect reconstruction condition:
H 0 (ω)G 0 (ω) + H1 (ω)G1 (ω) = 2 for all 0  ω  p (DSFT)




 (u) + H
 (u) = 2 for all 0  u  N-1 (DFT)
 (u)G
 (u)G
hence also H
0
0
1
1
and
H 0 (ω+p)G 0 (ω) + H1 (ω+p)G1 (ω) = 0 for all 0  ω  p (DSFT)




 (u) + H
 (u) = 0 for all 0  u  N-1 (DFT)
 (u+N/2)G
 (u+N/2)G
hence also H
0
0
1
1
• The filters H0, H1, G0, G1 can be either orthogonal
quadrature mirror filters (one filter H0 specifies all filters)
• Or H0 and H1 can be specified jointly which implies a biorthogonal DWT basis expansion.
622
Bi-Orthogonal Wavelets
• Bi-orthogonal wavelets are useful since they can be
made even-symmetric.
• This is important because much better results can be
obtained in image compression.
• Even though they do not have as good energy
compaction as orthogonal wavelets.
623
Bi-Orthogonal Wavelets
• Wavelet based image compression usually operates on large
image blocks (e.g., 64x64 in JPEG2000)
• It is much more efficient to realize block DWTs using cyclic
convolutions (shorter).
• As with the DCT, symmetric extended signals (without
discontinuities) give better results than periodic extended.
• Symmetric wavelets give better approximations to symmetric
extended signals.
• There are standard bi-orthogonal wavelets.
624
Daubechies 9/7 Bi-Orthogonal
Wavelets
Analysis LP
Analysis HP
Synthesis LP
Synthesis HP
H0(0) = 0.026749
H0(1) = -0.016864
H0(2) = -0.078223
H0(3) = 0.266864
H0(4) = 0.602949
H0(5) = 0.266864
H0(6) = -0.078223
H0(7) = -0.016864
H0(8) = 0.026749
H0(0) = 0
H1(1) = 0.091272
H1(2) = -0.05754
H1(3) = -0.59127
H1(4) = 1.115087
H1(5) = -0.59127
H1(6) = -0.05754
H1(7) = 0.091272
H1(8) = 0
G0(0) = 0
G0(1) = -0.091272
G0(2) = -0.05754
G0(3) = 0.59127
G0(4) = 1.115087
G0(5) = 0.59127
G0(6) = -0.05754
G0(7) = -0.091272
G0(8) = 0
G1(0) = 0.026749
G1(1) = 0.016864
G1(2) = -0.078223
G1(3) = -0.266864
G1(4) = 0.602949
G1(5) = -0.266864
G1(6) = -0.078223
G1(7) = 0.016864
G1(8) = 0.026749
These are used in the JPEG2000 standard.
625
D9/7 DWT of an Image
626
Zero-Tree Encoding Concept
• There are many proposed wavelet-based image
compression techniques. Many are complex.
• A simple concept is zero-tree encoding: Similar to
JPEG, scan from lower to higher frequencies and runlength code the zero coefficients.
(0, 0)
LLHL
HL
LLLH LLHH
LH
HH
Zig-zag scanning to find
zeroed coefficients
627
Wavelet Decoding
• Decoding requires
– Decompression of the
wavelet coefficients
– Reconstructing the
image from the
decoded coefficients:
2
2
G0
G1
S
2
2
G0
G1
S
2
2
G0
G1
S

I code
628
JPEG2000 Wavelet
Compression Example
Original
512 x 512
Barbara
629
16 : 1
630
32 : 1
631
64 : 1
632
128 : 1
633
Original
512 x 512
Boats
634
16 : 1
635
32 : 1
636
64 : 1
637
128 : 1
638
Comments on Wavelet Compression
• This has been a watered-down exposition of wavelet coding!
• Reason: there was never much adoption of JPEG2000.
• Why? Much more complex than “old” JPEG. A consideration
in an era of cheap memory.
• Questions of propriety.
• Performance primarily better in the high compression regime.
639
Case Study: Deep Image
Compression
640
Deep Perceptual Image Compression
• From Balle et al, find it here.*
• An autoencoder approach with some perceptual twists.
• Main differences:
– Use perceptually-motivated “divisive normalization transform” instead of ReLu or other
usual activation function.
– Uniformly quantize the code at the bottleneck (simulated during training)
– Add entropy-minimization term to the loss function
*Balle et al, “End-to-end optimized image compression,” Arxiv 16611.01704v3, Mar. 2017.
641
DNT
2x2
Downsample
192 x 5 x 5
Convolution
DNT
2x2
Downsample
192 x 5 x 5
Convolution
DNT
4x4
Downsample
image
256 x 9 x 9
Convolution
Basic Compression Architecture
Q
entropy
code
C
• The uniform quantizer is implemented in two ways. During testing, it simply
rounds to integer values. No weighting function.
• During training, since quantization is highly discontinuous, an additive model
is used.
• Quantization is modeled as additive uniform white noise U[-0.5, 0.5].
• Final code C accomplished by entropy coder CABAC (more complex than
Huffman)
642
Divisive Normalization Transform
(DNT)
• Basic vision science model of retinal (or cortical) neural normalization for decades.
• Basically, it is a spatial normalization of neural outputs by neighboring neural outputs.
There are many variations of this model.
• Occurs throughout sensory nervous system, as a way of adaptive gain control (AGC).
• Good model of the outputs of retinal ganglion neurons.
643
Divisive Normalization Transform
(DNT)
• Suppose that wk,i(m, n) are outputs of filter i in neural layer k (downsampled here).
• Normalize wk,i(m, n) by dividing by neighboring values from other channels j.
644
IDNT
4x4
Upsample
192 x 5 x 5
Convolution
IDNT
4x4
Upsample
IDNT
4x4
Upsample
192 x 5 x 5
Convolution
decoded
image
256 x 9 x 9
Convolution
Decompression and Loss
entropy
decode
C
• The “inverse” DNT is an approximation (Appendix of paper)
• The loss function is defined as a weighted sum of the MSE (b/w the original
image In, and the compressed image Jn) and the entropy of the code Cn.
ε  J n ,I n   λ  E  Cn  .
645
Compression
Example
• Many others in the
paper.
646
Example: JPEG
647
Example: Perceptual Autoencoder
648
Example: JPEG 2000
649
Example: Rate-Distortion Curves
• Very promising.
• Might see a new “JPEG” based on DLNs sometime!
• Many other competing deep compressors
• A hot research topic.
650
Comments
• So far we have been processing images –
now we will try to understand them –
image analysis … onward to Module 7.
651
Module 7
Image Analysis I
•
•
•
•
Reference Image Quality Prediction
No-Reference Image Quality Prediction
Deep Image Quality
Edge Detection
QUICK INDEX
652
IMAGE QUALITY
ASSESSMENT
653
IMAGE QUALITY ASSESSMENT
• What determines the quality of an image?
• The ultimate receiver is the human eye –
subjective judgement is all that matters.
• But how can this be determined by an
algorithm? How can it be quantified?
654
One Method of Subjective Quality
Assessment …..
655
An Important Problem!
656
Overall Natural Image
Communication System
Front-end
digital
processing
channel
Back-end
digital
processing
Mapping Perceptual
&
image
display
signal
Natural image Sensing &
signal
digitizing

The Natural Image
Transmitter
The Image Channel

The Natural Image
Receiver
657
Sources of Image Distortion
Front-end
digital
processing
channel
Back-end
digital
processing
Mapping Perceptual
&
image
display
signal
Natural image Sensing &
signal
digitizing

The Natural
ImageTransmitter
The Image Channel

The Natural Image
Receiver
658
Image Distortions
• Many distortions commonly occur – often
in combination.
–
–
–
–
–
–
–
Blocking artifacts (compression)
Ringing (compression)
Mosaicking (block mismatches)
False contouring (quantization)
Blur (acquisition or compression)
Additive Noise (acquisition or channel)
Etc
659
Reference IQA
660
“Reference” Image QA
• An “original” high quality image is presumed to
be known.
• Very useful for assessing effectiveness of image
compression and communication algorithms.
• There exist effective algorithms
661
MSE / PSNR
• The mean-squared error (MSE) is the long-standing
traditional image QA method.
• Given original image I and observed image J:
1 N-1 M-1
2
MSE(I, J ) =
 I(i, j) - J(i, j)


NM i=0 j=0
• The Peak Signal-to-Noise Ratio (PSNR) is:
L2
PSNR(I, J ) = 10 log10
MSE(I, J )
where L is the range of allowable gray scales (255 for 8 bits).
662
Advantages of MSE/PSNR
• Computationally simple.
• Analytically easy to work with.
• Easy to optimize algorithms w.r.t. MSE.
• Effective in high SNR situations.
663
Disadvantages of MSE/PSNR
• Very poor correlation with human visual
perception!
664
Einstein w/different distortions:
(a)
(b) MSE = 309
(c) MSE = 306
(d) MSE = 313
(e) MSE = 309
(f) MSE = 308
(g) MSE = 309
(a) original image
(b) mean luminance shift
(c) contrast stretch
(d) impulse noise
(e) Gaussian noise
(f) Blur
(g) JPEG compression
(h) spatial shift (to the left)
(i) spatial scaling (zoom out)
(j) rotation (CCW).
Images (b)-(g) have nearly
identical MSEs but very
different visual quality.
665
(h) MSE = 871
(i) MSE = 694
(j) MSE = 590
Human Subjectivity
• Human opinion is the ultimate gauge of image quality.
• Measuring true image quality requires asking many
subjects to view images under calibrated test conditions.
• The resulting Mean Opinion Scores (MOS) are then
correlated with QA algorithm performance.
• For decades, MSE had little competition despite 40 years of
research!
666
Structural Similarity
667
Structural Similarity-Based Models
• A modern, successful approach: Measure loss of
structure in a distorted image.
• Basic idea: Combine local measurements of
similarity of luminance, contrast, structure into a
local measure of quality.
• Perform a weighted average of the local measure
across the image.
668
An Aside: Weber’s Law
• Weber’s Law: The noticeability of a change in a
perceptual stimulus S (weight, brightness, loudness, taste,
odor, pain) depends on the percent or ratio of change.
• It may not be noticeable at all unless it exceeds some
threshold:
ΔS
S
> ε.
called a just noticeable difference (JND).
669
Structural Similarity Index (SSIM)
• The SSIM Index expresses the similarity of I and J at a point
SSIM I,J  i, j = LI,J  i, j  CI,J  i, j  SI,J  i, j
where
LI,J(i, j) is a measure of local luminance similarity
CI,J(i, j) is a measure of local contrast similarity
SI,J(i, j) is a measure of local structure similarity
670
Comparing Two Numbers
• The DICE index: Given two positive numbers A and B:
2AB
DICE(A,B) = 2
A + B2
then 0 < DICE(A, B) < 1.
• Avoid divide-by-zero:
2AB + ε
DICE(A,B) = 2
A + B2 +ε
671
A=B
672
Luminance Similarity
• Luminance similarity
2μ I (i, j)μ J (i, j) + C1
2μ Iμ J + C1
LI,J (i, j) = 2
= 2
2
μ I (i, j) + μ J (i, j) + C1 μ I + μ 2J + C1
where
μ I (i, j) =
P
Q
 
w(p, q)I(i+p, j+q)
p=-P q=-Q
P
Q
 
w(p, q) = 1
p=-P q=-Q
w(p, q) is an isotropic, unit area weighting function
and C1 is a stabilizing constant (later)
673
μI
LI , J
μJ
674
Luminance Masking
• Weber's Perceptual Law (applied to local average luminance):
Λ
ΔL
> τ.
L AVE
• Luminance term embodies Weber masking: let C1 = 0, then
2μ μ
in a patch
LI, J  2 I fJ2
μI + μJ
but if a patch luminance of the test image differs only by L
from the corresponding reference image patch: μ J  μ I  ΔL
then taking Lave = mI yields: LI,J 
2
 L + 1   L + 11
• Thus SSIM’s luminance term depends approximately just on
the ratio L. Thus, unless it is not true that L << Lave, then
LI ,J  1.
675
Contrast Similarity
• Contrast similarity
CI,J (i, j) =
2σ I (i, j)σ J (i, j) + C2
2σ I σ J + C2
=
σ 2I (i, j) + σ 2J (i, j) + C2 σ 2I + σ 2J + C2
where
σ I (i, j) =
P
Q
 
p=-P q=-Q
w(p, q)  I(i+p, j+q)-μ I (i, j)
2
and C2 is a stabilizing constant.
676
σI
CI , J
σJ
677
Contrast Masking
• The contrast term includes divisive normalization by local
image energy (both test and reference).
• Suppose a local patch contrast changes in the test image
relative to the corresponding reference patch:
σ J  σI  σ
• Then
CI, J 
2
  + 1    + 11
where
Θ
Δσ
.
σI
which is a Weber view of contrast masking.
678
Contrast Masking
679
Structural Similarity
• Structural similarity
σIJ (i, j) + C3
σIJ + C3
SI , J (i, j) =
=
σI (i, j)σ J (i, j) + C3 σI σ J + C3
where
σ IJ (i, j) =
P
Q
 
p=-P q=-Q
w(p, q)  I(i+p, j+q)-μ I (i, j) J(i+p, j+q)-μ J (i, j)
and C3 is a stabilizing constant.
680
σI, J
σI
SI , J
σJ
681
SSIM Index Flow Diagram
Reference
f
Test
f̂
682
Properties of SSIM
Important properties:
(1) Symmetry: SSIMI,J(i, j) = SSIMJ,I(i, j)
(2) Boundedness: 0 < SSIMI,J(i, j) < 1
(3) Unique Maximum: SSIMI,J(i, j) = 1 if and only
if the images are locally identical:
I(i+p, j+q) = J(i+p, j+q)
for –P < p < P, -Q < q < Q.
683
Stabilizing Constants
• C1, C2 and C3 are stabilizers – in case the local means
or contrasts are very small.
• For gray-scale range 0-255, C1 = (0.01·255)2, C2 =
(0.03·255)2, C3 = C2/2 work well and robustly.
684
Structural Similarity Index
• When C3 = C2/2 then the SSIM index simplifies to (Exercise)
SSIM I ,J
 2μ Iμ J + C1  2σIJ + C2 
 2
 2

2
2
 μ I + μ J + C1  σI + σ J + C2 
which is the most common form of the SSIM index.
• If C1 = C2 = 0, then it is the “Universal Quality Index” (UQI)
UQII ,J
 2μ Iμ J  2σ IJ 
 2
2  2
2 
 μ I + μ J  σ I + σ J 
which does not predict quality as well as SSIM and is more
unstable.
685
SSIM Map
• Displaying SSIMI,J(i, j) as an image is
called a SSIM Map. It is an effective way
of visualizing where the images I, J differ.
• The SSIM map depicts where the quality of
one image is flawed relative to the other.
686
(a)
(b)
(c)
(d)
(a) reference image; (b) JPEG compressed;
(c) absolute difference; (d) SSIM Map.
687
(a)
(b)
(c)
(d)
(a) original image; (b) additive white Gaussian noise;
(c) absolute difference; (d) SSIM Map.
688
Mean SSIM
• If SSIMI,J(i, j) is averaged over the image, then a
single scalar metric is arrived at. The Mean SSIM is
 1  N-1 M-1
SSIM(I,J ) = 
   SSIM I,J (i, j)
 NM  i=0 j=0
• The Mean SSIM correlates extraordinarily well
with human response as measured in large human
studies as Mean Opinion Score (MOS).
689
SSIM vs. MOS
On a broad database of images
distorted by jpeg, jpeg2000,
white noise, gaussian blur,
and fast fading noise.
Curve is best fitting logistic function a
1+be
-t/τ
1+βe
-t/τ
What is important is that the data cluster
closely about the curve.
.
690
SSIM vs. PSNR
Scatter Plots
691
Mean SSIM Examples
(a)
(b) MSE = 313
SSIM = 0.730
(c) MSE = 309
SSIM = 0.576
(d) MSE = 308
SSIM = 0.641
(e) MSE = 309
SSIM = 0.580
Einstein altered by different distortions. (a) reference image;
(b) impulse noise; (c) Gaussian noise;
(d) blur; (e) JPEG compression.
DEMO
692
Multi-Scale SSIM (MS-SSIM)
Best existing
algorithm
693
Multi-Scale SSIM
• All terms combined in a product, possibly with exponent
weights (Q = coarsest scale):
MS-SSIM I ,J
=  LI ,J 
αQ Q
β
q


C
  I,J  SI,J 
γq
q=1
• The exponents used vary, but a common choice is*
1 = 1 = 0.0448
2 = 2 = 0.2856
3 = 3 = 0.3001
4 = 4 = 0.2363
5 = 5 = 5 = 0.1333
Based on a small human study
of distorted image comparisons.
Note the shape of the exponent ‘curve’: bad scores are
penalized more at mid-scales (frequencies), similar to
the shape of the CSF.
694
European Study of Image
Quality Assessment Algorithms
• A three-university European study conducted
across three countries (Finland, Italy, and Ukraine)
gathered and analyzed more than 250,000 human
judgments (MOS) and 17 distortion types
• Known as Tampere Image Database (TID):
http://www.ponomarenko.info/tid2008.htm
695
TID Study
696
MS-SSIM / SSIM Usage
• Cable/Internet: Streaming TV providers Netflix, AT&T, Comcast,
Discovery, Stars, NBC, FOX, Showtime, Turner, PBS, etc etc use SSIM to
control streaming video quality from cable head-end and the Cloud.
• Broadcast: Huge international operators like British Telecom and
Comcast continuously run MS-SSIM on many dozens of live HD
channels in real-time to control broadcast (through the air) encoding.
• Satellite: Used worldwide: the Sky companies (Sky Brazil, Sky Italy,
BskyB UK, Tata Sky India), Nine and Telstra in Australia, Oi and TV
Globo in Brazil (many more) to monitor and control picture quality on
their channel line-ups.
• Discs: Used to control the encoding of video onto DVDs and Blu-Rays
(Technicolor, others)
• Equipment: Broadcast encoders made by Cisco, Motorola, Ericson,
Harmonic, Envivio, Intel, TI, etc etc … all rely on SSIM for product
control (best possible picture quality)
697
Streaming Video Pipeline
Video
Encoding
1 - Video Source
Cable
Satellite
4/5G
WiFi
2 - Video Transmit
3 - Video Receiver
Perceptual Encoder Control
Encode Inspection
(SSIM)
Video
Encoding
Cable
Satellite
4/5G
WiFi
ADJUST
ENCODE
RAW
VIDEO
H264/HEVC
VIDEO
2015 Primetime Emmy Award
Video on SSIM
700
No-Reference IQA
701
Reference vs. No-Reference
“Reference” IQA
algorithms
require that an
“original,”
presumably high
quality image be
available for
comparison.
Reference
image
“Reference”
QA algorithm
Hence they
are really
“perceptual
video fidelity”
algorithms.
Distorted
image
“No-reference”
QA algorithm
“No-reference”
IQA models
assume no
“original image”
is available to
compare.
Hence they
are pure
“perceptual
video quality”
algorithms.
702
Blind or No-Reference IQA
• No-Reference (or “Blind”) QA – Wherein
there is no reference image, nor other
information available. The image is
assessed based on its individual appearance.
703
Does this image have predictable distortion artifacts?
704
Blind (NR) Image QA
(advanced topic)
• A very hard problem. An algorithm must answer “What is the
quality of this image” with only the image as input.
• Complicated image content, spatial distribution of distortion,
aesthetics, etc.
• There is recent significant progress on this problem, using:
– Statistical image models
– Machine learning methods
• The first of these we can cover. The other is beyond this course.
705
Is this image of good quality? Why or why not?
706
How about
these?
707
And this one?
708
Which image is of better quality?
709
Case Study: No-Reference IQA
710
BRISQUE
(Blind/Referenceless Image Spatial Quality Evaluator)
• A machine learning-based natural image statistics
model that holds with great reliability for natural
images that are not distorted.
• Designing algorithms Given image I the meansubtracted contrast normalized (MSCN) image J is:
J(i, j) =
m(i, j)   k=-K
K

L
l=-L
w(k,l)I(i-k, j-l)
where w(i, j) is a unit-volume gaussian-like
weighting function
I(i, j)  m(i, j)
s(i, j)  1
s(i, j) 

K
k=-K

L
l=-L
w(k,l)  I(i-k, j-l)-m(i-k, j-l)
2
711
Natural Image Model
• For natural images, the MSCN image is
invariably very close to unit-normal gaussian
and highly decorrelated!
1
J(i, j) 
exp  a 2 / 2 
2p
• Why is this important? Because the MSCN
coefficients of distorted images are usually not
very gaussian and can have significant spatial
correlation.
712
I
I m
s 1
Boundary effects
Histogram (normalized)
of MSCN values
713
Image Scatter Plot
• Plots of adjacent pixels against each other:
I
I m
s 1
714
Distortion Statistics
• Common distortions
change gaussianity:
• But distorted MSCN
values can be wellmodeled as following a
generalized gaussian
distribution (GGD)
J distorted (i, j)  D1 exp   a / s 
715

Product Model
• Products of adjacent MSCN values should have a
certain shape if they are uncorrelated:
J(i, j)  J(i-1, j+1)  D2 K 0  a

K0 = modified Bessel function
of the second kind
• This happens to be infinite at the origin.
• It does not fit distorted image data (adjacent products).
716
Distortions Introduce Correlation
• General model for both distorted and undistorted MSCN
products: asymmetric generalized gaussian (AGGD).
exp  -  a/s    ; a < 0
L



J(i, j)  J(i-1, j+1)  D3 


 exp -  a/sR   ; a  0



• When no distortion, expect sL = sR.
• Correlations from distortion causes asymmetry since
adjacent MSCN values are more likely to be the same
sign.
717
Product Histograms
718
BRISQUE Features
• Given an image, compute the histogram of the
MSCN coefficients. Fit with GGD, estimate , s
(2 features)
• Compute products of adjacent pixels along four
orientations
• Compute histograms of all four types of
products, and for each estimate m, , sL, sR
(16 features)
18 features
719
Multiscale
• BRISQUE works well using features computed
from just two scales.
Image I
BRISQUE
feature extraction
Multiscale BRISQUE
features
Gs
2
BRISQUE
feature extraction
LPF
36 TOTAL features
720
Training
data
Learning
Machine
(Support Vector
Regression)
36 features
target feature
(MOS)
721
Concept of Machine Learning
• In our context, plot BRISQUE features and MOS in a highdimensional space (36 + 1).
• Find a good separation between the clusters of data that are
formed. Or simply regress on (curve fit) them. The machine
learns what features values to associate with MOS.
• Sometimes the number of clusters (number of distortion
types) is known, when training on data with distortion labels.
• SVR trick: Plot features in a much higher dimensional space
(spreading clusters out) then perform classification/regression.
722
Support Vector Regression
Nonlinear mapping
to a high-dimensional
space.
Most classifiers would have a hard
time separating these, even in a low
2D space.
723
Application
New image I
(distorted or not)
Trained Learning
Machine
(SVR w/gaussian
distance weight*)
Predicted human
opinion (MOS)
Linear correlation coefficient, 1000 train-test random divisions
of the LIVE Image Quality Database
724
*Also called Radial Basis Function in ML literature
The Algorithm
Runs in < 1 second in simple Matlab implementation (once loaded).
This variation of BRISQUE also predicts the distortion type.
DEMO
725
Applications of QA Algorithms
• Assessing the quality of algorithm results, such as denoising,
deblur, compression, enhancement, watermarking, etc. A huge
field of applications.
• Controlling the quality of images and videos online, (Netflix,
YouTube, Facebook, Flickr, Google, etc etc), cable TV, digital
cameras, cell phones/tablets, etc etc
• Designing algorithms using QA algorithms such as SSIM or
BRISQUE instead of the old MSE. The hope is that image
processing algorithms will perform much better! This is also a
huge field.
726
Case Study: Deep Image Quality
Prediction
727
Deep Image Quality Prediction
• Often-times the innovation is to collect enough sufficiently
representative data. This can be very hard.
• This is true in the field of image quality, where the labels are
human scores of perceived image quality.
• These are judgments rather than binary decisions (as in
ImageNet). Collecting scores on an image takes about 10s.
• Getting enough human subjects (need thousands) is especially
challenging. Laboratory studies cannot accomplish this.
728
IQA Database History
• First public subjective picture quality database: the LIVE
Image Quality Database (“LIVE IQA,” 2003).
• Live IQA: About 800 distorted images in five distortion
categories (noise, blur, JPEG, JPEG2000, packet loss)
• Rated by about subjects yielding about 25,000 human scores.
• Still heavily used. Most leading IQA models (SSIM, VIF,
FSIM) have been developed and tested on LIVE IQA.
• Many others have followed LIVE IQA using similar ideas.
• All are far too small and too-synthetic (distortions not real).
729
Drawbacks of Legacy Databases
~ 25 pristine images
Singly distorted Images
Single
Synthetic
distortions
•
Limited variety of image content (<30 pristine images).
•
Don’t model real world distortions in billions of mobile camera/social media images.
Synthetic
Distortions
vs
Authentic
Distortions
730
LIVE “In the Wild” Image Quality
Challenge Database
731
Where are the Images From?
•
More than 1100 images collected by hundreds of people in the US and
Korea. All are authentically distorted.
Wide variety of image content: pictures of people, objects, indoor, and outdoor scenes.
Authentic distortion types, mixtures, and severities.
Includes day/night and over / underexposed images.
732
Old Way
Controlled laboratory settings
• Fixed display device.
• Fixed display resolution.
• Fixed viewing distance.
• Controlled ambient illumination conditions.
• Skewed subject sample – typically
undergraduate and graduate students.
New Way
LIVE in the Wild Challenge
Database
733
Crowdsourcing
734
State-of-the-Art Blind IQA
Models on LIVE Challenge
LIVE IQA DB
LIVE Challenge DB
735
New vs. Old
All prior benchmark
databases
Authentic, real world
distortions
Highly varied illumination
artifacts
Diverse source capturing
devices
Mixtures of distortions
Reference images
Uncontrolled view conditions
for the subjective study






In the Wild Mobile Picture
Quality Database






Paper here: Ghadiyaram et al, “Massive online crowdsourced study of subjective and objective
picture quality,” IEEE Transactions on Image Processing, January 2016.
736
Deep NR IQA Models
• Still not big enough for end-to-end design of really big
networks, like ResNet-50. These types of networks require
millions of datum, or they overfit.
• Such studies are planned / in progress but are very
expensive!
• However, pretty good results are attained using finetuning.
• In a broad study it was found that pre-trained deep
learners can achieve better results than existing shallow
perceptual predictors.
• But not near human performance – a very hard
problem, like object recognition.
737
Tested Deep NR IQA Models
• AlexNet + SVR: Pretrained on ImageNet, outputs of 6th FC layer
(4096) fed to SVR, fine-tuned on LIVE Challenge. AlexNet has 62M
parameters.
• ResNet-50 + SVR: Pretrained on ImageNet, average-pooled features
(2048) fed to SVR, fine-tuned on LIVE Challenge. ResNet-50 has
26M parameters.
• AlexNet + fine-tuning: Pretrained on ImageNet, dropout added before
last FC layer, 8 epochs of training on LIVE Challenge.
• ResNet-50 + fine-tuning: Pretrained on ImageNet, dropout added
before last FC layer, 6 epochs of training on LIVE Challenge.
• Imagewise CNN: Used as a baseline. 8 Conv layers with ReLu, 3 FC
layers.
738
Baseline Model
Refer to paper here.*
1 1128 111
28  28 128
56  56  64
images
112  112
patches
112  112  48
convolution
+ activation
1
output
2× 2pooling
fully connected
+ activation
*Kim, et al., Deep convolutional neural models for picture quality prediction,
IEEE Signal Processing Magazine, 2017.
739
Outcomes
• Deep models can (but not always) beat shallow perceptual models,
like BRISQUE. CORNIA and FRIQUEE are much more complex than
BRISQUE.
• ResNet-50 shows clear superiority wrt to other models.
• Human performance is around 0.95+ … long way to go
740
Deep NR IQA Models
• Still not big enough for end-to-end design of really big
networks, like ResNet-50. These types of networks require
millions of datum, or they overfit.
• Such studies are planned / in progress but are very
expensive!
• However, pretty good results are attained using finetuning.
• In a broad study it was found that pre-trained deep
learners can achieve better results than existing shallow
perceptual predictors.
• But not near human performance – a very hard
problem, like object recognition.
741
EDGE DETECTION
742
EDGE DETECTION
•
Probably no topic in the image analysis area has been studied as
exhaustively as edge detection.
• Edges are sudden, sustained changes in AVERAGE image intensity
that extend along a contour.
• Edge detection is used as a precursor to most practical image analysis
tasks. Many computer vision algorithms use detected edges as
fundamental processing primitives.
•
Some reasons for this :
- Enormous information is contained in image edges. An image
can be largely recognized from its edges alone.
- An edge map requires much less storage space than the image
itself. It is a binary contour plot.
743
Edges?
744
Edges? … Shapes?
Neon Color Spreading Illusions
745
Overview of Edge Detection Methods
•
Methods for edge detection have been studied since the mid-1960's.
•
Easily been the most studied problem in the image analysis area.
• Hundreds of different approaches devised - based on just about any
math technique imaginable for deciding whether one group of
numbers is larger than another.
•
The problem became (at last) fairly well-understood in the 1980's.
•
We will study three fundamental approaches:
- Gradient edge detectors
- Laplacian edge detectors (multi-scale)
- Diffusion-based edge detectors
746
Edge detection before computers
747
GRADIENT EDGE DETECTORS
•
Oldest (Roberts 1965) but still valuable class of edge detectors.
•
For a continuous 2-D function f(x, y), the gradient is a two-element
vector:
f(x, y) = [fx(x, y), fy(x, y)]T
where
f x (x, y) =


f(x,y) and f y (x, y) =
f(x,y)
x
y
are the directional derivatives of f(x, y) in the x- and y-directions:
y
x
748
Gradient Measurements
• Fact: the direction of the fastest rate of change of
f(x, y) at a point (x, y) is the gradient orientation
-1  f y (x,
y) 
θf (x, y) = tan 

f
(x,
y)
 x

• and that rate of change is the gradient magnitude
M f (x, y) = f x2 (x, y) + f y2 (x, y)
• The gradient is appealing for defining an edge
detector, since we expect an edge to locally exhibit
the greatest rate of change in image intensity.
749
Isotropic Gradient Magnitude
• Fact: Take directional derivatives in any two
perpendicular directions, (say x´ and y´) and
Mf(x, y) remains unchanged: y´
x´
• Mf(x, y) is rotationally symmetric or isotropic.
• Isotropicity is desirable for edge
detection –detect edges equally
regardless of orientation.
750
Gradient Edge Detector Diagram
• Described by the following flow diagram:
image
I
discrete
differentiation
point
operation
edge enhancement
threshold
edge map
E
detection
- digital differentiation: digitally approximate I
- point operation: estimate MI = |I|
- threshold: decide which large values of MI are
likely edge locations
751
Digital Differentiation
• Digital differentiation is differencing. For a 1-D
function f(x) which has been sampled to produce
f(i), either: d
dx
or
f(x)
x=i
 f(i) - f(i-1)
d
f(i+1) - f(i-1)
f(x) 
dx
2
x=i
• The advantage of the first is that it takes the current
value f(i) into the computation; the advantage of the
second is that it is centered.
752
2-D Differencing
• In two dimensions, the extensions are easy:

f(x, y)
 f(i, j) - f(i-1, j)
x
(x,y)=(i,j)
and

f(x, y)
 f(i, j) - f(i, j-1)
y
(x,y)=(i,j)

f(i+1, j) - f(i-1, j)
f(x, y)

x
2
(x,y)=(i,j)

f(i, j+1) - f(i, j-1)
f(x, y)

y
2
(x,y)=(i,j)
753
Example: 1-D Differencing
• The signal I(i) could be a scan line of an
image. Let DI(i) = |I(i)-I(i-1)|.
20
15
10
I(i) =
DI(i) =
10
0
5
0
0
3
6
9
12
15
0
3
• A clear, easily thresholded peak!
6
9
12
15
754
Example: 1-D Differencing With Noise
• The signal J(i) = I(i) + N(i) may be a noisy image scan
line. Let DJ(i) = |J(i)-J(i-1)|.
20
8
6
J(i) = 10
DJ(i) =
4
2
0
0
0
3
6
9
12
15
0
3
6
9
12
15
• Noise is a huge problem for this type of edge detector.
Differentiation always emphasizes high frequencies (such
as noise)
755
Types of Gradient Edge Detectors
• Define convolution edge templates x and y which
produce directional derivative estimates:
1
 x  1 1  y 
1
 x  1 0 1 / 2
1 0
x 
0 1
1
y  0 / 2
1
0 1
y 
1 0
adjacent
centered
Roberts’
• The performance of these three is very similar.
756
Noise-Reducing Variations
• Designed to reduce noise effects by averaging
along columns and rows:
1 0 1
 x  1 0 1 / 3
1 0 1
1 1 1
y  0 0 0 / 3
1 1 1
1 0 1
1 2 1
 x  2 0 2 / 4  y  0 0 0 / 4
1 2 1
1 0 1
• These also perform similarly.
Prewitt
Sobel
757
Gradient Magnitude
•
The point operation combines the directional derivative estimates x and
y into a single estimate of the gradient magnitude.
•
The usual estimates:
(A) M(i, j) =  2x (i, j) +  2y (i, j)
(B) M(i, j) =  x (i, j) +  y (i, j)

(C) M(i, j) = max  x (i, j) ,  y (i, j)

•
The following always hold (Exercise):
C≤A≤B
•
(A) is the correct interpretation, but (B) and (C) are cheaper - no square
or square root operations.
•
(B) often overestimates edge magnitude of an edge, while (C) often
758
underestimates edge magnitude.
The Texas Instruments
Gradient Magnitude
• Better than (B) or (C) is



1
(D) M(i, j) = max  x (i, j) ,  y (i, j)  min  x (i, j) ,  y (i, j)
4
• Still, the differences between (A)-(D) are slight.
759

Thresholding the Gradient
Magnitude
• Once the estimate M(i, j) is obtained, it is thresholded to find
plausible edge locations.
• This produces the binary edge map E:
1 ; M(i, j) > τ
E(i, j) = 
0 ; M(i, j)  τ
• Thus:
- a value '1' indicates the presence of an edge at (i, j)
- a value value '0' indicates the absence of an edge at (i, j)
• The threshold t constrains the sharpness and magnitude of
the edges that are detected.
760
DEMO
Gradient Edge Detector Advantages
• Simple, computationally efficient
• Natural definition
• Work well on "clean" images
761
Gradient Edge Detector Disadvantages
• Extremely noise-sensitive
• Requires a threshold - difficult to select - usually requires
interactive selection for best results.
• Gradient magnitude estimate will often fall above
threshold over a few pixels distance from the true edge.
So, the detected edges are often a few pixels wide.
• This usually requires some kind of "edge thinning"
operation - usually heuristic.
• The edge contours are often broken - gaps appear. This
requires some kind of "edge linking" operation - usually
heuristic.
762
LAPLACIAN EDGE DETECTORS
• Edge detectors are based on second derivatives.
• For a continuous 2-D function f(x, y), the Laplacian is
defined:
2
2


2f(x, y) =
f(x, y) + 2 f(x, y)
2
x
y
• It is a scalar not a vector.
• Fact: If the directional derivatives are taken in any other
two perpendicular directions (say x´, y´) the value of the
Laplacian 2f(x, y) remains unchanged: y´
x´
763
Laplacian Edge Detector Diagram
• The Laplacian edge detector is described by the
following flow diagram:
image
I
discrete
differentiation
zero-crossing
detection
edge map
E
edge enhancement detection
• Digital differentiation: Digitally approximate 2I
• Zero-crossing detection: Discover where the
Laplacian crosses the zero level.
764
Reasoning Behind Laplacian
1-D Edge Profile:
Differentiated Once:
Differentiated Again:
• A zero crossing or ZC occurs near the center of the edge
where the slope of the slope changes sign.
765
Digital Twice-Difference
• For a 1-D function f(x) → f(i), we use:
d
f(x)  f(i) - f(i-1) = y(i)
dx
x=i
and then
d2
f(x)  y(i+1) - y(i) = f(i+1) - 2f(i) + f(i-1)
2
dx
x=i
with convolution template:
1 2 1
766
Digital Laplacian
• In two dimensions:
2I(x, y) ≈ [I(i+1, j) - 2I(i, j) + I(i-1, j)]
+ [I(i, j+1) - 2I(i, j) + I(i, j-1)]
with convolution template:
1
0 1 0
1 2 1  2  1 4 1
1
0 1 0
767
Example: Twice-Differentiation
• Let I(i) be image scan line, I (i) = I(i+1) - 2I(i) + 2I(i-1)
20
20
10
I(i) =
 I (i) =
10
0
-10
-20
0
0
3
6
9
12
15
0
3
6
9
12
15
• Clearly reveals a sharp edge location: a single largeslope ZC and several "smaller" zero-crossings.
768
Example: Twice-Differentiation in Noise
•
Scan line J(i) = I(i) + N(i) of a noisy image:
20
10
 I (i) =
I(i) = 10
-10
0
0
•
0
3
6
9
12
15
0
3
6
9
12
15
Numerous spurious ZCs.
• Noise is an even bigger problem for this type of edge detector.
Differentiating twice creates highly amplified noise.
DEMO
769
Smoothed and Multi-Scale
Laplacian Edge Detectors
• Laplacian too noise-sensitive to be practical there is always noise.
• But modifying it in a simple way it can be made
very powerful.
• The basic idea is encapsulated:
image
I
Low-pass filter
G
Laplacian

2
detect zero
crossings
edge map
E
- smooth noise
- determine edge scale
• The difference: a linear blur (low-pass filter) is
applied prior to application of the Laplacian.
770
Low-Pass Pre-filter
• The main purpose of a low-pass pre-filter to the Laplacian
is to attenuate (smooth) high-frequency noise while
retaining the significant image structure.
• The secondary purpose of the smoothing filter is to
constrain the scale over which edges are detected.
• Note: a high-pass operation (such as Laplacian) followed
(or preceded) by a low-pass filter will yield a band-pass
filter, if their passbands overlap.
771
Gaussian Pre-filter
•
A lot of research has been done on how the filter G should be selected.
•
It has been found that the optimal smoothing filter in the following two
simultaneous senses:
(i) best edge location accuracy
(ii) maximum signal-to-noise ratio (SNR)
is a Gaussian filter: (K is an irrelevant constant)

G(i, j) = K  exp - i 2  j2  / 2σ 2

772
Laplacian-of-Gaussian Edge Detector
•
Define the Laplacian-of-a-Gaussian or LoG on I:
J(i, j) = 2[G(i, j)*I(i, j)]
= G(i, j)*2I(i, j)
= 2G(i, j)*I(i, j)
•
Above 3 forms equivalent since linear operations (differentiation and
convolution) commute.
• Best approach: pre-compute the LoG:
 i 2  j2 
 i 2  j2 
 G(i, j)  1 
 exp  
2 
2 
σ
2σ




2
and convolve image I with it. (Constant multiplier omitted since only ZCs
773
are of interest).
Polar Form of LoG
• The LoG is isotropic and can be written in polar
form:
2 
2 


r
r
2G(r)   1  2   exp   2 
 σ 
 2σ 
 2 G(r)
DFT
rotate through 360°
2s
LoG in space and frequency
774
ZC Detection
• The last stage of edge detection is zero-crossing detection.
• Let J = [J(i, j)] be the result of LoG filtering.
• A ZC is a crossing of the zero level: the algorithm must
search for pixel occurrences of the form:
 
+

 

+
 0 

0

 0 

0

• By a convention one sign or the other is marked as the edge
location unless higher (sub-pixel) precision is needed. 775
Scale of LoG
• The larger the value of s used, the greater the
degree of smoothing by the low-pass pre-filter G.
• If s is large, then noise will be greatly smoothed but so will less significant edges.
• Noise sensitivity increases with decreases in s but the LoG edge detector then detects more
detail.
DEMO
776
Digital Implementation of LoG
•
Use the sampled LoG:
 i 2  j2 
 i2  j2 
 G(i, j)  1 
 exp  
2 
2 
σ 

 2σ 
2
•
There are specific rules of thumb that should be followed:
•
Enough of 2G(i, j) must be sampled. The LoG will not work unless the template
contains both main and minor lobes. In practice, the radius R of the LoG (in
space) should satisfy
R ≥ 4s (in pixels)
•
Once a LoG template is computed, its coefficients must be slightly adjusted to sum
to zero (why?)
•
This is done by subtracting the (average) total coefficient sum from each.
•
The LoG will not work well unless s ≥ 1 (pixel)!
DEMO
777
Thresholding the ZCs
•
Thresholding not usually necessary if a sufficiently large operator (s) is used.
•
However, sometimes it is desired to both detect detail and not detect noise.
•
This can be accomplished with effectively by thresholding.
•
Let J(i, j) be the LoG-filtered image and E(i, j) be the edge map.
•
Then find the gradient magnitude |J(i, j)| (Roberts' form will suffice since J is smooth).
•
If
|J(i, j)| > t = threshold
and
E(i, j) = 1 (a ZC exists at (i, j))
then leave the ZC. Otherwise delete it.
DEMO
778
Contour Thresholding
•
Problem: Simple thresholding may create broken ZC contours.
•
Approach: Compare all ZC pixels on a ZC contour to a threshold t. If enough
are above threshold, accept the contour, else reject it.
•
Suppose (i1, j1), (i2, j2), (i3, j3) ,..., (iL, jL) comprise an 8-connected ZC contour.
•
Compute |J(in, jn)| for n = 1 ,..., L.
•
Let Q = # points such that |J(in, jn)| > t.
•
If
Q/L > PERCENT then accept the entire ZC contour
Q/L ≤ PERCENT then reject the entire ZC contour
•
Typically, PERCENT > 0.75
DEMO
779
Advantages of the LoG
• Usually doesn't require thresholding.
• Yields single-pixel-width edges always (no thinning!).
• Yields connected edges always (no edge-linking!).
• Can be shown to be optimal (under some criteria).
• Appears to be very similar to what goes on in biological
vision.
780
Disadvantages of the LoG
• More computation than gradient edge
detectors.
• ZCs continuity property often leads to ZCs
that meander across the image.
• ZC contours tend to be over-smooth near
corners.
781
Some Highly
Relevant Neuroscience
782
The Retina Close Up
Light
Light
From Gray’s Anatomy
Key neurons: rods, cones, horizontal cells, bipolar cells, amicrine cells, ganglion cells
783
Retinal Neurons
Light
ganglion
cells
amicrine
cells
horizontal
cells
granules
rods
cone
Retina is about 0.025 cm thick with about 100,000,000
photoreceptors
Cones are photoreceptors operative in well-lit conditions
(photopic vision); respond to colors.
Rods are photoreceptors operative in low-light
conditions (scotopic vision); monochromatic.
Granules: connectivity to next layer
Horizontal cells: Interconnects and spatially sums either
bipolar cells
rods or cone outputs (not both).
Bipolar cells: Connects horizontal cell outputs to
ganglion cells with positive (‘ON’) or negative
(‘OFF’) polarity.
Amicrine cells: Believed to control light adaptivity of
photoreceptors and adjacent cells.
Ganglion cells: Spatially integrates the responses of
rods
photoreceptors – essentially via other cells.
784
Ganglion Cells
• Spatial responses to visual stimuli known as receptive fields.
• Sums responses of photoreceptors. Each receives input from
about 100 photoreceptors (with great variability).
• Receptive field response is center-surround excitatoryinhibitory with on-center or off-center.
• Actually a form of spatial digital filtering.
• Output to the visual cortex via the optic nerve, chiasm, and
lateral geniculate nucleus (LGN).
785
Center-Surround Response
• Also called lateral inhibition.
-
+
-
+
-
+
+
-
+
+
+ +
786
Response to Herman Grid
Non-foveal: Little photoreceptor response,
perception dominated by (larger) ganglion field
responses. At “intersections”, excitation and
lateral inhibition cancel creating a small
response, hence appears dark.
Find the black dot
Fovea: Ganglion receptive fields very small.
Perception dominated by photoreceptors, i.e.,
direct luminance is perceived.
787
Similar Illusion with Color
788
Response to Mach Band
Mach Bands
789
Difference-of-Gaussian
Approximation
• Nobel laureates R. Granit and H.K. Hartline measured
ganglion receptive field responses in cats.
• Well-approximated by difference-of-gaussian (DOG):
2
1  (i2  j2 )/2 s2
1
 (i 2  j2 )/2 ks 
DOG(i, j) 
e

e
2
2
2ps
2 p  ks 
790
Plot of 1-D DOG
Note excitatory and inhibitory lobes
791
Plots of 2-D Spatial DOG
Note excitatory-inhibitory regions
792
Computed Responses of DOG
Filters on Hermann Grid
DOG
793
DOG Ganglion Cells
There is a diverse distribution of DOG receptive field sizes
across the retina….
794
What are DOG “Filters” for?
• Two main theories:
– Decorrelation of the input signal
– Edge enhancement/detection
795
Decorrelation
The response of each DOG is weakly
correlated (or uncorrelated) from other
DOG responses, captures unique
information.
Leads to efficient representations. Similar
to how image compression algorithms
work. Look at these “low entropy
responses.”
Small filterbank
Image
Some DOG responses
796
Compressibility
• Each DOG response is highly compressible
797
LoG Closely Approximates
Ganglion Receptive Fields!
DOG
LoG
Overlaid
• If k ~ 1.6, the DOG and LoG are almost indistinguishable
2
1  (i2  j2 )/ 2 s2
1
 (i 2  j2 )/2 ks 
 G s (i,i) 
e

e
2
2
2ps
2p  ks 
798
2
Improving on the LoG /
DOG
799
Canny’s Edge Detector
• Attempts to improve upon the LoG.
• Consider an ideal step-edge image:
derivative ||
to edge
derivative ^
to edge
• Since 2 is isotropic, it is equivalent to taking:
- A twice-derivative perpendicular to the edge. This
conveys the edge information.
- A twice-derivative parallel to the edge. This conveys no
edge information!
• In fact, if there is noise in the image, the parallel twicederivative will give only bad information!
800
Canny’s Algorithm
• Follows the following basic steps:
(1) Form the Gaussian-smoothed image:
K(i, j) = G(i, j)*I(i, j)
(2) Compute the gradient magnitude and
orientation:
|K(i, j)| and K(i, j)
using discrete (differencing) approximation.
801
Canny’s Algorithm
• (3) Let n be the unit vector in the direction K(i, j). Compute
the twice-derivative of K(i, j) in the direction n:
2

K(i, j) =
2
n
K xx (i, j)K 2x (i, j) + 2K xy (i, j)K x (i, j)K y (i, j) + K yy (i, j)K 2y (i, j)
K(i, j)
2
(4) Find the ZC’s in the image.
(5) Identically the zero-crossings of K × (K × K).
• Disadvantage: nonlinear, so edge continuity not guaranteed.
However, contour thresholding can improve performance.
DEMO
802
Comments
• We will continue studying image analysis
… onward to Module 8.
803
Module 8
Image Analysis II
• Superpixels
• Hough Transform
• Finding Objects and Faces
QUICK INDEX
804
SUPERPIXELS
805
Simple Iterative Linear Clustering
(“SLIC”)
• A simple way of image segmentation.
• Not into objects, which is very hard, but instead into
“superpixels,” which are meaningful atomic regions.
• More than pixels, less than objects or whole regions.
• SLIC is particularly effective. Really a clustering method.
• Most effective when color is used.
806
SLIC Initialization Steps
807
SLIC Distances
808
SLIC Iteration
• Initialize: at each pixel (i, j) set l(i, j) = -1 and d(i, j) = infinity
• Iterate: For every cluster center Ck, do the following:
– For each pixel (i, j) associated with Ck,
– Compute the distance D between Ck and (i, j)
– If D < d(i, j), then set d(i, j) = D and set l(i, j) = k.
•
Compute a new set of cluster centers (e.g., center of mass of each cluster)
•
Form an error between the old cluster centers and the new ones. If below
threshold, STOP.
•
Or, can just iterate T times. Typically T = 10. In examples T = 10.
809
SLIC Examples
k = 64, 256, 1024
810
SLIC Examples
Video
811
SLIC Comments
•
Really a simple variation of k-means clustering in a local application.
•
A very popular method!
•
It is used to create intermediate features in a wide variety of computer vision
problem solutions.
•
As compared to the low level features we will shortly encounter (edges, SIFT
keypoints, local binary patterns, etc).
•
It is an ad hoc method (not based on a theory or perceptual principles), yet still
effective.
812
HOUGH TRANSFORM
813
Hough Transform:
Line and Curve Detection
• The Hough Transform is a simple,
generalizable tool for finding instances of
curves of a specific shape in a binary edge
map.
• What it does:
edge map
"Circle" Hough
transform result
814
Advantages of Hough Transform
• It is highly noise-insensitive - it can "pick" the
shapes from among many spurious edges. Any edge
detector gives rise to "pseudo-edges" in practice.
• It is able to reconstruct "partial" curves containing
gaps and breaks to the "ideal" form.
• It can be generalized to almost any desired shape.
815
Disadvantages of Hough Transform
• It is computation- and memory-intensive.
816
Basic Hough Transform
• Assume that it is desired to find the locations of curves that
- Can be expressed as functions of (i, j)
- Have a set (vector) of parameters
a = [a1 ,..., an]T
that specify the exact size, shape, and location of the
curves.
• Thus, curves of the form
f(i, j; a) = 0
817
Curves With Parameters
• 2-D lines have a slope-intercept form
f(i, j; a) = j - mi - b = 0
where a = (m, b) = (slope, j-intercept)
• 2-D circles have the form
f(i, j; a) = (i - i0)2 + (j - j0)2 - r2 = 0
where a = (i0, j0, r) = (center coordinates, radius)
818
Basic Hough Transform
• Input to Hough Transform: An edge map (or
other representation of image contours).
• From this edge map, a record (accumulator) is
made of the possible instances of the shape in the
image, and their likelihoods, by counting the
number of edge points that fall on the shape
contour.
819
Hough Example
• Straight line / line segment detection
The "most likely" instance of
a straight line segment in
this region.
"Blow-up" of an
edge map
• The line indicated is the "most likely" because there are
seven pixels that contribute evidence for its existence.
• There are other "less likely" segments (containing 2 or 3
edge pixels).
820
Hough Accumulator
• The Hough accumulator A is a matrix that accumulates
evidence of instances of the curve of interest via counting.
• The Hough accumulator A is n-dimensional, where n is
the number of parameters in a:
a = [a1 ,..., an]T
• Each parameter ai, i = 1 ,..., n can take only a finite number
of values Ni (the representation is digital).
• Thus the accumulator A is an
N1 x N2 x · · · · · · x Nn-1 x Nn
matrix containing N1·N2·N3 · · · · · · Nn-1·Nn slots.
821
Size of Hough Accumulator
• The accumulator A becomes very large if:
- Many parameters are used
- Parameters are allowed to take many values
• The accumulator can be much larger than the image!
• It is practical to implement the accumulator A as a
single vector (concatenated rows): otherwise many
matrix entries may always be empty (if they're
"impossible"), thus taking valuable space.
822
823
Accumulator Design
• The design of the Hough accumulator A is
critical to keep its dimensions and size
manageable.
• Creating a manageable Hough Accumulator
is an art.
• The following are general steps to follow.
824
Accumulator Design
STEP ONE - Use appropriate curve equations. For
example, in line detection, the slope-intercept
version is poor (nearly vertical lines have large
slopes).
• A better line representation: polar form
f(i, j; a) = i cos() + j sin() - r = 0
where a = (r, ) = (distance, angle)
r

825
Accumulator Design
STEP TWO - Bound the parameter space. Do a little
math. Only allow parameters for curves that sufficiently
intersect the image.
• What is "allowable" depends on the application - perhaps
(for example) circles
- Must lie completely inside the image
- Must be of some min and max radius
Interesting lines solid
Irrelevant lines dashed
Interesting circles solid
Irrelevant circles dashed
826
Speaking of Circles
Which of these arcs is a piece of the largest circle?
827
Example of Accumulator Design
•
In an N x N image indexed 0 ≤ i, j ≤ N-1, detect circles of the form
f(i, j; a) = (i - i0)2 + (j - j0)2 - r2 = 0
having min radii 3 (pixels) and max radii 10 (pixels), where the circles
are contained by the image.
•
This could be implemented, for example, in a bus’s coin counter.
•
Clearly:
3 ≤ r ≤ 10
and for each r :
r ≤ i0 ≤ (N-1) - r
r ≤ j0 ≤ (N-1) - r
or part of some of the circles will extend outside of the image.
•
Note: This rectangular accumulator array leads to unused array entries.
828
Example of Accumulator Design
•
In an N x N image indexed 0 ≤ i, j ≤ N-1, detect lines of the form
f(i, j; a) = i cos() + j sin() - r = 0
that intersect the image.
•
Thus each line must intersect the four sides of the image in two places.
•
Either the i-intercept or the j-intercept must fall in the range 0 ,..., N-1:
i-intercept = r / cos()
j-intercept = r / sin()
•
So we can bound as follows: either
0 ≤ r / cos() ≤ N-1
or
0 ≤ r / sin() ≤ N-1.
829
Accumulator Design
STEP THREE - Quantize the accumulator space. Curves are digital so the
accumulator is a finite array.
•
Decide how "finely-tuned" the detector is to be.
•
For circles, the choice may be easy - let circle centers (i0, j0) and circle radii be
integer values only, for example.
•
For lines, the selection can be more complex - since digital lines aren't always
"straight":*
63°
45°
135°
90°
117°
0°
Even drawing straight lines digitally requires an
approximation algorithm such as the one by
Bresenham or the one by Wu.
Likewise, no drawn digital circle is really a circle,
but rather an approximation.
These are links to Wikipedia.
27°
153°
ETC
830
Example of Accumulator Design
•
In an N x N image, detect circles of the form
f(i, j; a) = (i - i0)2 + (j - j0)2 - r2 = 0
•
Radius and center constraints:
3 ≤ r ≤ 10 and r ≤ i0 ≤ (N-1) – r and r ≤ j0 ≤ (N-1) – r
•
Assume radii and circles centers are integers:
r {3, ..., 10}
and for each r the circles are completely contained within the image:
r ≤ i0 ≤ (N-1) – r and r ≤ j0 ≤ (N-1) – r
•
Then the accumulator contains the following number of slots:
# circle types = [N - 2r]2 = 8N2 - 208N +1520 ≈ 8N2
•
This is about 8 times the image size!
•
Circle detection algorithms often assume just a few radii, e.g., in the
coin counter example, just a few coin sizes.
831
Accumulator Design
STEP FOUR - (Application)
(1) Initialize the accumulator A to 0.
(2) At each edge coordinate (i, j), [E(i, j) = 1] increment A:
- For all accumulator elements a such that (e small)
|f(i, j; a)| ≤ e
set
A(a) = A(a) + 1
until all edge points have been examined.
(3) Threshold the accumulator. Those parameters a such that
A(a) ≥ t
represent instances of the curve being detected.
•
Since local regions of the accumulator may fall above threshold,
detection of local maxima is usually done.
832
Comments on the Hough
Transform
• Substantial memory required even for simple
problems, such as line detection.
• Computationally intensive; all accumulator
points are examined (potentially incremented) as
each image pixel is examined.
• Refinements of the Hough Transform are able to
deal with these problems to some degree.
DEMO
833
Refining the Hough Transform
•
The best way to reduce both computation and memory requirement is to reduce the size
of the Hough array. Here is a (very) general modified approach that proceeds in stages:
(0) Coarsely quantize the accumulator. Define error threshold e0 and variable
thresholds ti , ei.
(1) Increment the Hough array wherever (where em > e0)
| f(i, j; a) | ≤ em
(2) Threshold the Hough array using a threshold tm.
(3) Redefine the Hough array by
- Only allowing values similar to those ≥ tm
- Re-quantize the Hough array more finely over these values
(4) Unless em < e0, set em+1 < em, tm+1 < tm and go to (1)
•
The exact approach taken in such a strategy will vary with the application and available
resources, but generally the Hough array can be made much smaller (hence faster)
834
“My students are my products!”
- Paul Hough
835
VISUAL SEARCH
836
Where’s The Panda?
Where’s the Panda?
837
Where He Got the Idea
Again … where’s the Panda?
838
Luggage Nightmare
839
Find The Faces
(there are 11 of them!)
840
Find The Faces and the Animals
(8 hidden faces and 8 hidden animals)
Currier and Ives advertisement from year 1872
841
find the face
842
Painted by Giuseppe
Arcimboldo in year 1590
find the face
843
find the …
844
Find the Snow Leopard
845
Maybe this Fellow is Easier
846
TEMPLATE MATCHING
847
TEMPLATE MATCHING
• Find instances of a sub-image, object,
written character, etc.
• Template matching uses special windows.
• A template is a sub-image:
Template image of ‘P’
848
Template
• Associate with a template T a window BT:
T = {T(m, n); (m, n)  BT}.
• As before, the windowed set at (i, j) is:
BTI(i, j) = {I(i+m, j+n); (m, n)  BT}
• Goal: Measure goodness-of-match of template T
with image patches BTI(i, j) for all (i, j)
849
Template Matching
• Start with mis-match (MSE) between BTI(i, j) and T:
MSE{BT I(i, j), T} 

 
I(i+m, j+n)-T(m, n)
2
(m,n)  B T
   I(i+m, j+n) -2   I(i+m, j+n)T(m, n)+    T(m, n)
(m,n)  BT
 BT
(m,n)  B T
 (m,n)




2
local image energy
2
cross-correlation of I and T
template energy (constant)
= E BT T(i, j)  2CBT T(i, j), T  E T
• MSE is small when match CBT T(i, j), T is large.
850
Normalized Cross-Correlation
• Upper bound:
CBT T (i, j), T =

 
I(i+m, j+n)T(m, n)
(m,n)  B T
   I(i+m, j+n)   T(m, n)
2
(m,n)  B T
2
(m,n)  B T
 E BT T (i, j)  E T
with equality if and only if
I(i+m, j+n) = K·T(m, n) for all (m,n)  B T
From Cauchy-Schwartz Inequality:

(m, n)
A(m, n)B(m, n) 
2
2
  A(m, n)   B(m, n)
(m, n)
(m, n)
851
Normalized Cross-Correlation
• Let ĈBT T (i, j), T =
CBT T (i, j), T
E BT T (i, j)  E T
ˆ
hence 0  C
B T T (i, j), T  1 for every (i, j).
• Normalized cross-correlation image:
J = CORR[I, T, BT]
J(i, j) = ĈBT T(i, j), T for 0  i  N-1, 0  j  M-1
852
Normalized Cross-Correlation
Value 1
Image
patch
Image
Normalized
cross-correlation map
853
Thresholding
• Find the largest value (best match):
J = CORR[I, T, BT]
• Alternately, threshold: K(i, j) = 1 if J(i, j) ≥ t
• If t = 1, only perfect matches will be found.
• Usually t is close to but less than one.
• Finding t usually a trial-and-error exercise.
DEMO
854
Limitations of Template Matching
• Template matching is quite sensitive to object
variations.
• Noise, rotation, scaling, stretching, occlusion … and
many other changes confound the matching.
855
SIFT
(Scale Invariant Feature Transform)
856
A SIFT Primer
• SIFT (Scale Invariant Feature Transform) is a
powerful and popular method of visual search /
object matching.*
• It can find objects that have been rotated and scaled.
• At has limited ability to match objects that have
been deformed.
• We will outline the steps involved and the features
that are used in SIFT.
*The brainchild of David Lowe, U. British Columbia
857
Basic Idea
Pre-computed
New unknown
object.
Compute SIFT
features (keypoints).
Identified object
(if any)
Feature
matching
Database of known
object images.
Set of SIFT features
(keypoints) for each
object.
The number of “keypoint” SIFT features may be
quite large.* They are related to edges.
*Perhaps 2000 for a 512x512 image.
858
SIFT Features
• Keypoint candidates are the maxima of DoG-filtered
responses. Given gaussian at scale sigma:

1
i 2  j2  / 2σ 2
G s (i, j) =

exp


2ps2

• The DoG response to image I is
Ds (i, j) =  G ks (i, j)  G s (i, j)*I(i, j)
where k = 2(1/s) (often s = 2).
Note : G ks (i, j)  G s (i, j)  (k-1)s 2G s (i, j)
859
DoGs Computed Over Scales
Gs
Gk s
_
_
_
G k (P-2) s
_
G k (P-1) s
_
G kP s
Image I
Gaussians
DoG-filtered I
860
Extrema Detection
A DoG-filtered
sample is an
extrema if largest
or smallest of 26
surrounding pixels
in scale-space
Adjacent
scales
DoG responses
(close-up)
All extrema are found over all DoG scales.
861
Extrema Processing
• Each “extrema” keypoint is then carefully examined to
see whether:
 It is actually located near/on an edge
 If so, if the edge has a strong enough contrast
(based on a local interpolation) otherwise it is rejected.
• Remaining extrema are assigned an orientation that is
the gradient orientation of the DoG response.
• These are the final keypoints.
862
SIFT Features at Keypoints
• All keypoints for an image stored in database are stored and associated with
location, scale, and orientation.
Depiction of histograms of gradients
from local patches around each
keypoint.
Length of each arrow is (distanceweighted) sum of gradient
magnitudes near that direction.
• At the location and scale of each keypoint, the gradient magnitudes and
orientations (relative to the keypoint orientation) are found at all points in a
patch around the keypoint.
• These are formed into descriptors that vectorize the histograms of the gradient
magnitudes/orientations from sub-patches of the larger patch.
• Lastly, these vectors are made unit-length to reduce the effects of illumination.
863
Matching
• When a new image is to be matched, the SIFT descriptors are
computed from it.
• The database consists of a set of images, possibly large, each with
associated SIFT descriptors.
• Matching the image descriptors to the database requires a
process of search which can be done in many possible ways. Lowe
uses nearest-neighbor search (Euclidean distance).
• Lowe also uses a Hough Transform – like method to cluster and
count keypoint descriptors. This matching process also uses
keypoint location, orientation, and scale relative to those found in
the database.
• For details, see Lowe’s paper: link.
864
Lowe’s Examples
Image
Objects being
searched for.
Recognition results
Outer boundaries
show matches keypoints are
indicated by
centers of squares
(sized by scale)
865
Example
Object found in
image.
Object in database.
Keypoints at +s.
The object found has undergone, rotation, shift, illumination, and scale
change. It has also undergone an affine change in perspective.
866
Video Examples
Magazine being
moved through
space
View from a
moving vehicle
What is shown” SIFT feature tracking on two
video scenes, and a modified “Affine SIFT”
867
Comments on SIFT
• SIFT utilizes both fundamental and ad hoc ideas. Yet is
works very well within its domain (finding rigid objects).
• It has strong invariance to:
•
•
•
•
Rotations
Translations
Scale changes
Illumination changes
• It has reasonable invariance to:
• Occlusions
• Affine transformations
• It has weak invariance to:
• Objects that deform or change (like faces)
• You can get a SIFT demo program at David Lowe’s
868
website: http://www.cs.ubc.ca/~lowe/keypoints/
Speaking of Orientation….
869
Bouba and Kiki
Class Poll
870
General-Purpose Feature Extractors
• An extraordinary number of feature types have been
proposed for image analysis. Often used to find
“keypoints,” like SIFT.
• Used in a great variety of visual tasks: image/object
detection, recognition, matching, tracking, classification
and many more.
• Generally they are configured to extract local properties
such as edges, corners, texture, change, interestingness…
• Usually the features are illumination-invariant
(difference-based).
871
General-Purpose Feature Extractors
•
Most are rather general-purpose, and are often a bit ad hoc with some
underlying sensibility. SIFT falls in this category.
•
Following are some other features, explained in their simplest forms.
All can and have been adapted to supply scale-invariance, rotationinvariance, and other ways of generalizing them.
•
This is the classical image analysis paradigm:
•
•
– Find highly descriptive features in the image
– Invariant to some of scale, orientation, illumination, illumination gradients, affine
transformations (object changing pose in 3D space), etc.
– Feed the features to a classifier or regressor to train it to conduct a task using
these features
Application: extract the same features on images, using the trained
regressor/classifier to conduct the same task
Still dominant in practice, but deep learning is changing this.
872
Harris Features
873
Harris Features
• Let
I x (i, j) = I(i, j)   x (i, j)
I y (i, j) = I(i, j)   y (i, j)
x and y are directional difference operators, like Sobel.
• Let G(I, j) be a gaussian. Define smoothed derivatives:
2
A(i, j) =  I x (i, j)  G(i, j)
2
A C
M=
B(i, j) =  I y (i, j)   G(i, j)

C
B


C(i, j) =  I x (i, j)  I y (i, j)   G(i, j)
• Harris operator
det  M  -k  Trace  M     AB-C2  -k  A + B 
2
2
874
Harris Features
Local maxima result
875
SURF Features
(Speeded-Up Robust Features)
876
SURF Features
• Hessian matrix
 I xx (i, j) I xy (i, j) 
H=

I
(i,
j)
I
(i,
j)
yy
 xy

where
Iab (i, j)  I(i, j)  G ab (i, j)
 2G
G ab (i, j) =
(i, j)
ab
• Then the SURF features (the authors use a “box”
approximation to the gaussian for speed)
det  H  = I xx (i, j)I yy (i, j)   I xy (i, j) 
2
• Many variations (‘BRIEF,’ ‘BRISK,’ ‘FREAK’, etc.)
877
SURF Features
Matching result
878
LBP
(Local Binary Patterns)
879
LBP Features
•
An efficient approach. Break an image into 16x16 blocks:
•
For every pixel in the block, within its 3x3 neighborhood:
– Going clockwise from top left, compare center pixel with 8 neighbors. If greater, code
‘0’, if not greater, code ‘1.’
– Creates an 8-bit code at every pixel.
– The transform of the block into the code is called a Census Transform
– It can be modified into a 9-bit code by comparing each neighborhood pixel with the
neighborhood mean. Then called Modified Census Transform (MCT).
– Going the further step to create the 256 8-bit code vector for each 16x16 block is LBP.
880
LBP Features
Original
Decimal LBP
881
Case Study: Face Detection
882
Detect
or These
these
Faces?
Faces?
883
Face Detection
884
Why is this Photograph Important?
Taken by
Robert
Cornelius in
1839
It is the first
“Selfie”!
885
Did You Detect the Face?
886
Face Detection Using ViolaJones Method
• A very fast and accurate way to detect faces.
• Now a classical approach used in many digital cameras,
smart-phones, and so on.
• Likely the “squares-around-the-faces” casual users see in
their viewfinders is some version of the V-J model.
• Basic Idea: Use simple, super-cheap features iteratively
evaluated on small image windows using a “boosted”
classifier.
887
Integral Image
• Given image I(i, j), define the integral image
II(i, j) =
i
j
  I(m, n)
m=0 n=0
• It is just the sum of all pixels within the rectangle with
upper left corner (0, 0) and lower right corner (i, j):
(0, 0)
(i, j)
• Exercise: Show the simple one-pass recursion:
S(i, j) =S(i, j-1) + I(i, j)
II(i, j) = II(i, j-1) + S(i, j)
S(i, -1) = 0
II(-1, j) = 0
888
Rectangular Sum Property
• Given a pre-computed integral image II(i, j) of image I(i, j).
• The pixel sum of I(i, j) over any rectangle D can then be
found by 4 table look-ups:
A
B
C
D
• SUM(D) = SUM(ABCD)
– SUM(AB) – SUM(AC) + SUM(A)
889
Viola-Jones Features
• The original Viola-Jones concept computes four kinds of
spatial differences of adjacent rectangle sums as features:

+

+

+


+
+

• All sizes (scales), aspect ratios, and positions of each of these
features is computed in 24x24 sub-windows on an image.
• About 180,000 of these (highly over-complete: only 242 = 576
basis functions needed). A lot of training, but end algorithm is
very fast!
890
Basic Idea: Boosting
• A “cascade” of T very simple, weak, binary classifiers:
All possible
sub-windows
1
False
True
2
False
True
3
True
False
···
T
True
Combine
False
Rejected sub-windows
• First stage considers all possible sub-windows. Many are
eliminated at this stage
• Each stage considers all remaining sub-windows from most
recent stage, rejecting most (but fewer with later stages).
• Many fewer sub-windows considered later, but with greater
computation
• A sub-window must survive all stages.
891
Weak Classifier
• For each of the many features j at each stage t, train a simple
single-feature classifier ht,j, t = 1, …., T.
• Each weak classifier ht,j gives a binary decision by finding
the optimal threshold t,j so the minimum number of 24x24
training images are misclassified (face / no face).
• Each classifier has the simple form:
1 ; if f j  Ii  >θ t,j
h t, j  Ii  = 
0 ;otherwise
where fj is the computed rectangle-based feature indexed
892 j.
Error Function
 
• Goal: Minimize the error function ε t, j =  w h I i  F i
t,i
t, j
i
over all N (24x24) training images Ii (many of these), where
1 ; Ii contains a face
F i=
0;otherwise
• Weights wt,i sum to 1 at each stage t, but change/adapt over
time/stage. This is called adaptive boosting (AdaBoost).
• At each stage t, choose the one classifier among {ht,j} having
minimum et,j = et. Call this ht; it is the only one used!
893
Weighting Functions
• Weights (penalties) are varied over t else nothing different
happens. In the Viola-Jones model, a typical adaptation is
used. Assume N = m + n training images.
• First stage:
1/  2m  ; Fi = 0 (no face)
w 1,i = 
1/  2n  ; Fi =1 (face)
• Normalize each stage: w 
t,i
• Update equation:
 εt 
w t+1,i = w t,i 

 1  εt 
1  ei 
m = # negatives
n = # positives
w t,i

N
w
j=1 t, j
 0;Ii classified correctly
where ei = 
 1;otherwise
• Ad hoc but reasoning is sound
894
Final “Strong” Classifier
• Use at most T features, one from each stage.
 1 ; if  T α t h t  Ii  > (1/2) T α t
t=1
t=1
h  Ii  = 
0 ;otherwise
where
 1  εt 
α t = log 

 εt 
• Many realizations and variations exist. V-J first trained their
model on about 5000 hand-labeled 24x24 images with faces
(positives) and about 9500 non-face images (negatives).
• At each stage a new set of 24x24 non-face images was
selected automatically from about 350,000,000 contained in
the about 9,500 non-face images.
895
Examples of V-J Labeled Faces
896
Viola-Jones Face Detector
• The exemplar V-J face detector use(d) 38 stages (T=38)
• When demonstrated on an old 700MhX Pentium on images of
size 384x288, each image was classified in about 0.067 seconds.
• About half of video frame rate (15 frames/second).
• Easy to do in real-time today. Look at your smartphone / DSLR!
• The Viola-Jones Face Detector was the first really large-scale
success of machine learning in the image analysis field.
897
Examples
• First two features selected
by their cascade:
From their paper
DEMO
Someone else’s version
898
Flashed Face Illusion
Nobody understands this remarkable
and disturbing illusion …
899
Comments
• The V-J face detector was perhaps the first
really large-scale “success” of machine
learning in the image analysis field.
• We will continue studying image analysis
… onward to Module 8.
900
Module 9
Image Analysis III
•
•
•
•
•
Models of Visual Cortex
Oriented Pattern Analysis
Iris Recognition
Range Finding by Stereo Imaging
Deep Stereopsis
QUICK INDEX
901
Let’s revisit the visual brain…
902
Primary Visual Cortex
• Also called Area V1 or striate cortex. Much goes on
here.
• Visual signals from the ganglion cells of the eye are
further decomposed into orientation- and scale-tuned
spatial and temporal channels.
• These are passed on to other areas which do motion
analysis, stereopsis, and object recognition.
903
Primary
visual cortex
(from below)
From D.
Hubel, Eye
and Brian
904
Types of Cortical Neurons
• Early 1960’s, Nobel laureates
D. Hubel and T. Wiesel performed
visual experiments on cats.
• Inserted electrodes into visual cortex, presented patterns to
animals’ eyes, measured neuronal responses.
• Found two general types of neurons termed simple cells and
complex cells.
• Defined by their spatial responses to images.
905
Simple and Complex Cells
• Simple cells are well-modeled as linear.
We can model their outputs by linear
convolution with the image signal.
• Complex cells receive signals from the
simple cells. Their responses are nonlinear.
906
Stages on the visual
pathway from retina
to cortical cells.
907
Simple Cell Responses
• Simple cells respond to (at least) three signal
aspects:
• Spatial pattern orientation, frequency and
location
• Spatial binocular disparity
• Spatio-temporal pattern motion
908
Simple Cell Spatial Response
• Simple cell responses
have excitatory and
inhibitory regions –
multiple lobes.
• Appear to match spatial
edges (shown) and
bars.
Red: excitatory
Blue: Inhibitory
909
Simple Cell Distribution
• The simple cells spatial responses correspond to a wide
range of orientations, lobe separations, and sizes.
“Bar-sensitive” simple cells
“Edge-sensitive” simple cells
• Seemingly “spatially tuned” to detect edges and bars, thus
representing images that way. They are well modeled as ...
910
2-D Gabor Function Model
• Gabor functions are Gaussian functions that modulate
(multiply) sinusoids. l controls elongation.
 
1
g c (x, y) 
e
2
2pls
2


 x
 y2  / 2 s2
 l

  u0
v0 
cos  2p  x  y  
M 
 N
 x 2 2
2

 l   y  /2 s
  u0
v0 
1


g s (x, y) 
e
sin  2p  x  y  
2
2pls
M 
 N
• “Cosine and “sine” Gabors are in 90 deg quadrature and
model the bar and edge simple cells, respectively
911
Phase Quadrature
“Edge-sensitive”
“Bar-sensitive”
• The simple cells appear to occur in phasequadrature pairs – 90 deg out of phase
912
Quadrature Gabor Pair
“Bar-sensitive” cosine Gabor
“Edge-sensitive” sine Gabor
• The simple cells appear with widely variable orientations,
elongations, and sizes.
913
Phasor Form of Gabor
• Easy to manipulate
g(x, y)  g c (x, y)  1g s (x, y)
 
1

e
2
2pls
2


 x
 y 2  /2 s2
 l

e
2 p 1 u 0 x / N  v0 y/M 
• Natural representation of simple cell pairs
• Fourier transform is simply a shifted Gaussian
 v)  e
G(u,
2
2
2
2 ps   u  u 0  l 2   v  v0  


914
Digital Form
g(i, j)  g c (i, j)  1gs (i, j)
 
1

e
2
2pls
2


 i
 j2  /2 s2
 l

e
2 p 1 u 0i/ N  v0 j/M 
• Similar considerations as designing LoG
915
Minimum Uncertainty
• Amongst all complex functions and in any dimension, Gabor
functions uniquely minimize the uncertainty principle:
 xf (x) 2 dx  uf (u) 2 du  1




2
2
  f (x) dx   f (u) du  4



• Similar for y, v.
• They have minimal simultaneous space-frequency duration.
• Given a bandwidth, can perform the most localized spatial
analysis.
916
Fourier Transform Magnitudes of Gabors
Sine and cosine Gabors have same FT magnitudes. Phasor form only half plane.917
Gabor Function History
• Gabor functions first studied in context of
information theory by Nobel laureate D. Gabor.
•
In 1980 S. Marcelja noted that the simple cell receptive fields
measured along 1D are well-modeled by 1-D Gabor functions which
have optimal space-frequency localization.
•
In 1981 D.A. Pollen and S.F. Ronner showed that the simple cells
often appear in phase quadrature pairs, which are natural for Gabor
functions.
•
In 1985 J. Daugman observed that receptive field profiles fit 2D
Gabor functions and have optimal 2D space-frequency localization.
918
Gabor Filterbanks
• Exact arrangement of simple cells not known, but all orientations
are possible (horizontal/vertical more common).
• The cortex is layered. Simple cells that are in different layers
but similar positions come from the same eye and are
responsive to similar orientations. Called ocular dominance
columns.
• Bandwidths in the range 0.5 octave – 2.5 octave are common.
• In engineering, constant-octave filterbanks covering the plane
are common.
919
Gabor Filterbanks
• Constant-octave Gabor
filter bank (cosine form)
• Nine orientations, four
filters/orientation
• Constant-octave Gabor
filter bank (phasor form)
• Eight orientations, five
filters/orientation
920
The Image Modulation Model
(where we model an image as a sine wave!)
I(x, y)  A cos  f(x, y)
921
FM Approximation
• We’ll do this in 1-D for simplicity. A 1-D continuous Gabor:
g(x)  g c (x)  1g s (x)
1
 x 2 / 2 s2 2 p

e
e
2ps
1 u 0 x 

G(u)
e
2 ps   u  u 0 
2
2
• 1-D FM image: I(x)  A cos  f(x)  Assuming the instantaneous
frequency f(x) changes slowly, then
A 
g(x) * I(x)   G  f(x)   exp  1f(x) 
2
• Contains responses of both cos Gabor and sin Gabor filters
922
2D FM Approximation
• 2-D continuous Gabor:
g(x, y)  g c (x, y)  1g s (x, y)
 v)
G(u,
• 2-D FM image: I(x, y)  A cos  f(x, y) Assuming the
instantaneous frequencies fx(x,y), fy(x,y) change slowly, then
A 
g(x, y) *I(x, y)   G fx (x, y), f y (x, y)   exp  1f(x, y) 
2
• Contains responses of both cos Gabor and sin Gabor filters
923
FM Demodulation
• 1-D Gabor-filtered FM image demodulation:
A 
g(x) * I(x)   G  f(x)
2
• 2-D Gabor-filtered FM image demodulation:
A 
g(x, y) * I(x, y)   G fx (x, y), f y (x, y) 
2
924
Energy Model for Quadrature
Simple Cell Demodulation
Cosine Gabor
(.)2
S
Sine Gabor
(.)2
925
Complex Cells
• Less well-understood. Poorly understood and
nonlinear responses.
• Viewed as accepting simple cell responses as
inputs, and processing them in ways that are not
yet understood.
• Simplest task might be the demodulation process on
the previous slide.
926
While we do not know
how the vision system
processes the spatial
simple cell (Gabor-like)
responses, we can
certainly use the model
creatively for image
processing.
927
Using V1 Cortical Cell Models
• Massive decomposition of space-time visual data
provides optimally localized atoms of pattern, pattern
change, and pattern disparity information.
• This information is passed en masse to other brain
centers – to accomplish other processing.
928
Spatial Orientation Estimation
• Now assume an image has a strong oriented
component modeled as an FM function
I(x) = A cos [f(x)], x = (x, y)
• Like sine wave gratings discussed earlier,
but local frequencies and orientations
change.
• Goal: Estimate  f ( x )   f x ( x ), f y ( x ) 
929
Image with strong
FM component(s)
930
Frequency Modulation
• The instantaneous frequencies define the
direction and rate of propagation of the
pattern:
 f(x) 
f 2x ( x )  f 2y ( x )
  f ( x )  T an
1
 f y ( x ) / f x ( x ) 
931
Demodulation
• Simple approach:
(1) Pass image through bank of complex Gabor filters
ti (x) = I(x)*gi(x); i = 1 ,…, K
(2) At each x find largest magnitude response (demodulate)
i# = arg maxi | ti (x)|
t#(x) = ti#(x)
(3) Then
 t # (x) 
f(x)  Re 

#
 1  t (x) 
932
Example
Image
Orientation and frequency
needle map
Reconstructed FM
component
933
Example
Image
Orientation and frequency
needle map
Reconstructed FM
component
934
Segmentation by Clustering
Multi-partite
image
Reconstructed FM
component
Orientation and
frequency needle map
Segmentation by
k-means clustering
935
Comments
• Preceding algorithm utilizes the linear filter
responses which closely model V1 simple cell
responses.
• The algorithms also use nonlinear processing of
many simple cells responses which could be
similar to V1 complex cell processing.
• The image modulation model can be made much
more general:
K
I(x, y)   Ai cos  fi (x, y)
i=1
936
K=43 Components
937
K=43 Components
938
K=43 Components
939
K=43 Components
940
K=6 (only!) Components
941
The 6 Components
+
+
+
+
+
942
943
Iris Recognition
944
Flow of Visual Data
Area V5 or MT
Dorsal stream
Area V1
Ventral stream
(object recognition,
long-term memory)
LGN
945
Ventral Stream
• The ventral stream (also called the “what pathway”)
begins with Area V1, and lands in the inferior temporal
gyrus.
• Much shared processing occurs between V1, V2 (on the
way to ITG), and the ITG.
• Areas devoted to object recognition, memory, and closer
to consciousness.
946
Inferior Temporal Gyrus
V1
Also called Inferior Temporal (IT) Cortex
947
Inferior Temporal Gyrus
(Fusiform Gyrus)
• Has available entire retinotopic map from Area V1.
• Considerable feedback and forth from V1
• Devoted to visual memory and visual object recognition.
• Not much known about “how” it’s done.
• Computational algorithms for object recognition using
V1 primitive models have proved effective.
948
Biometrics: Recognition of Iris
949
Daugman’s Iris Recognition System
• One of the earliest approaches to recognition using
V1 primitives was J. Daugman’s iris recognition
system.
• Uses spatial V1 (Gabor) models responses to
perform important biometric application:
recognition of the iris of the eye.
• Far-seeing invention is still the industry standard
of performance.
950
Iris Recognition
(1) Detect and extract iris and “unroll it”
951
Iris Recognition
(2) Filter each row with 1-D cosine (top) and
sine Gabor filters, and binarize result.
952
Iris Recognition
cardinality  A XOR B 
H(A, B) 
t
cardinality  A 
(3) Compare binary rows with binary reference of iris
stored in database using Hamming Distance.
•
•
•
If A = B, then H(A, B) = 0.
Random chance, H(A, B) = 0.5.
For t = 0.3, typical false positive rate is < 10-7.
953
Seeing in 3-D
• Three-dimensional vision involves many modes
of perception:
–
–
–
–
–
Relative size
Stereo vision
Motion parallax
Occlusion
Foreshortening, convergence of lines, etc etc etc
• A famous illusions on this theme … the Ames
Room … the eye-brain can be fooled by 3D cues!
• We will study a main mode: stereo vision
954
RANGE FINDING BY STEREO
IMAGING
• Next we will study a significant computer vision
application: stereo vision.
• The goal is to compute a 3-D map using multiple
cameras.
• We will introduce the stereo camera geometry and the
concept of triangulation.
• At the core of stereo vision is a very hard problem that is
called The Correspondence Problem.
• Of course, humans see in stereo quite well …. (click
here)
955
Stereo Camera Geometry and
Triangulation
• Begin with the geometry for perspective projection:
Z
Y
y
f = focal length
x
image plane
X
lens center
(X, Y, Z) = (0, 0, 0)
•
•
•
•
•
•
(X, Y, Z) are points in 3-D space
The origin (X, Y, Z) = (0, 0, 0) is the lens center
(x, y) denote 2-D image points
The x - y plane is parallel to the X - Y plane
The optical axis passes through both origins
956
The image plane is one focal length f from the lens
Relationship Between 3-D and 2-D
Coordinates
•
We found that a point (X, Y, Z) in 3-D space projects to a point
f
(x, y) = (X, Y)
Z
in the 2-D image, f = focal length, = magnification factor.
•
We will now modify the camera geometry. First, we'll shift the camera to
the left and right along the X-axis (in real space).
•
We will then consider the case where we have two cameras, equidistant
from the origin (X, Y, Z) = (0, 0, 0).
•
By relating the projections (images) from the two cameras, we will find
that it is possible deduce 3-D information about the objects in the scene.
957
Shifting the Camera to the Left
• Suppose that we shift the camera (lens center) along the XZ
axis by an amount -D:
Y
y´
image plane
f = focal length
x´
X
lens center
(X, Y, Z) = (-D, 0, 0)
• Note: the coordinates of the image are now denoted (x’, y’).
• Now a point (X, Y, Z) in 3-D space projects to a point
f
(x', y') = (X+D, Y)
Z
in the left-shifted image.
958
Shifting the Camera to the Right
•
Suppose we shift the camera (lens center) along the X-axis by amount +D:
Z
Y
y´´
image plane
f = focal length
x´´
X
lens center
(X, Y, Z) = (+D, 0, 0)
•
•
The coordinates of the image are now denoted (x’’, y’’).
Now a point (X, Y, Z) in 3-D space projects to a point
f
(x'', y'') = (X-D, Y)
Z
•
in the right-shifted image.
Note that the optical axis is still parallel to the Z-axis.
959
Binocular Camera Geometry
•
Now suppose that we place two cameras (with the same focal lengths f),
at a distance 2D apart (the baseline distance), with parallel optical axes:
Z
Y
left (x´, y´)
image plane
f = focal length
right (x´´, y´´)
image plane
X
left lens center
(X, Y, Z) = (-D, 0, 0)
•
right lens center
(X, Y, Z) = (+D, 0, 0)
This is a parallel or nonconvergent binocular camera geometry.
• Suppose we are able to identify the images of an object feature (a point)
whose coordinates are (X0, Y0, Z0) in 3-D space.
•
We could do this, e.g., interactively, by pointing a mouse at the image of
the object feature in both of the images and determining the coordinates.
960
Triangulation
•
The projections of the point (X0, Y0, Z0) in the two camera images are
and
f
(X 0 +D, Y)
Z
f
(x 0'', y0'') = (X 0 -D, Y)
Z
(x 0', y0') =
•
The horizontal and vertical disparities between images of (X0, Y0, Z0) are:
x0 = x0´ - x0´´ = (f/Z0) [(X0+D) - (X0-D)] = 2Df/Z0
y0 = y0´ - y0´´ = (f/Z0) (Y0 - Y0) = 0
•
In a non-convergent system, the vertical disparity is always zero.
•
The horizontal disparity is extremely useful. We can solve for Z0 in terms
of it:
2Df
Z0 =
x 0
961
Ramifications
•
The triangulation equation
Z0 =
2Df
x 0
implies that we can compute the distance, from the baseline (hence to
anywhere), to any point in the scene that is visible to both cameras ....
•
... provided we can find its horizontal image coordinates x0´ and x0´´.
•
This follows since the camera focal length f is known, the baseline
separation 2D is known, and the (X, Y, Z) coordinate system is known (it's
defined by the camera placement)
•
This approach is used in aerial stereo-photogrammetry using two cameras
a known distance apart (e.g., on the wingtips).
•
Historically, a human operator would painstakingly identify matching
points between the images, measure their coordinates, and compute depths
via triangulation.
•
There is an old device called a stereoplotter that was used for this.962
Simple Cell Disparity Responses
• Back to visual cortex….
• A small percentage of simple cells appear to be sensitive
to disparities between the signals coming from the two
retinae.
• These may accept input from simple cell pairs tuned to
corresponding locations on the two retinae.
• Likely this is used in 3D depth perception.
963
Stereopsis is Computational
random-dot stereogram
•
Cross your eyes ….
•
Or “relax” them:
random-line stereogram
•
As proved by B. Julesz’ random dot and line stereograms.
• A similar visual aid … click here.
964
An Autostereogram
Remember those “Magic Eye” books?
965
The Depth Map
966
In Motion!
967
Stereo (Binocular)
Geometry
• Vergence angle depends on
object distance.
• Shown are angular and
positional disparities.
• The simple cells appear to
respond to the positional
disparities.
• But the vergence signals are also
sent to the depth processing areas
of the brain. They are part of the
oculomotor control system.
968
Disparity-Sensitive Simple Cells
Spatial-only
simple cells
Disparity-sensitive
simple cells
969
Phase-Based Stereo
• Recall 2D Gabor filter model:
g(x)  Ke
 
 x 2 2

 y  / 2 s2
2 p 1 u 0 x  v0 y 
 l

e
• Find filterbank responses
ti (x) = I(x)*gi(x); i = 1 ,…, K
• Largest magnitude response
i* = arg maxi |ti (x)|
t*(x) = ti*(x)
970
Left and Right Responses
• Find largest response for both left and right camera images:
 f(x )  exp  1f (x ) 
t *L (x)  A  G
L
L
L 

 f( x )   exp  1f ( x ) 
t *R (x)  A  G
R
R
R 

• Demodulate to obtain fL (x L ) and fR (x R )
  t* (x )  
fL (xL )  Cos Re  *L L  
  t L (xL )  
1
  t* (x )  
fR (xR )  Cos Re  *R R  
  t R (xR )  
1
• Stereo assumption
fL(xL) - fR[xL - x(xL)] = 0
971
Left and Right Responses
• Here is (truncated) 2-D Taylor’s formula:
f (x 0  x, y 0  y)  f (x 0 , y0 )  f x (x 0 , y0 )x  f y (x 0 , y0 )y
or
f (x 0  x)  f (x 0 )  x Tf (x 0 )
• Taylor Approximation [Note: x(xL) = (x(xL), 0)] :
fL (x L )  fR (x L )  Δx(x L )T fR (x L )
Algorithm!
fL ( x L )  fR ( x L )
Δx( x L ) 
fR (x L )
972
973
Phase-Based Stereo Example
“Turf”
Brighter = nearer
974
Phase-Based Stereo Example
“Tree trunk”
Brighter = nearer
975
Phase-Based Stereo Example
“baseball”
Brighter = nearer
976
Phase-Based Stereo Example
“Doorway”
Brighter = nearer
977
Phase-Based Stereo Example
“Stone”
Brighter = nearer
978
Phase-Based Stereo Example
“Pentagon”
Brighter = nearer
979
Case Study: Deep Stereopsis
980
Deep Stereopsis
• Uses the idea of a “siamese” network: dual, parallel
networks with shared weights (that are thus trained
together and equal). General model, from paper here:*
*Zbontar, et al., “stereo matching by training a convolutional neural network to compare
image patches,” Journal of Machine Learning Research, 2016.
981
Deep Stereopsis
• Applied to patches by maximizing a similarity score
between potential patch matches.
• Trained on “Kitti,” - large dataset with ground truth depths
measured by a terrestrial lidar range scanner.
• Also the small Middlebury Stereo dataset.
• Basic aspects:
– Cross-entropy loss on binary decision match / no match
– Matching cost is combined over “cross-based” neighboring
windows to handle depth discontinuities
– A disparity smoothness criteria modifies loss
– Subpixel enhancement and post-filtering with median and
bilateral filters
*Zbontar, et al., “Stereo matching by training a convolutional neural network to compare
image patches,” Journal of Machine Learning Research, 2016.
982
Cross-Based Cost
• Define a support region of four “arms” of each pixel p:
left, right, top, down.
• Example: The left arm of p are those pixels satisfying
I(p)  I(p L )  D1
p  pL  D2
• Similar for the other three arms.
983
Cross-Based Cost
• Consider a pixel p in the left image, and a pixel p - d in the
right image, shifted by disparity d.
• Let UR and UL indicate support regions in the left and right
images: UL(p) is the support region at p in the left image and
UR(p – d) is the support region at p – d in the right image.
• The “combined” left-right support region is
Ud(p) = {q: q  UL(p) , q – d  UR(p – d)}.
• When training, run the Siamese networks once, and the FC
layers d times, where d = maximum allowed disparity.
984
Matching Cost
• The matching cost is (iteratively) averaged over the
combined region:
C0(p, d) = CNN(p, d)
Ci(p,
1
d) =
U d (p)

i 1
C (q, d)
qU d ( p )
where for patches PL(p), PR(p – d):
CNN(p, d) = cross-entropy[PL(p), PR(p – d)]
• Perform three iterations to obtain C4(p, d)
985
Disparity Smoothness
• A very old concept in computational stereo.
• Additional (smoothness) cost on the disparity function D(p):
4
E (D) =  C (q) 
p
λ1  D(p)  D(q)
qNp
+ λ 2 1 D(p)  D(q) 1
where
qNp
1{} = binary set indicator function
Np = neighborhood of p
l1, l2 are weights
986
Final Depth Prediction
• Optimize:
D(p, d) = arg mind C(p, d)
• To correct remaining errors and improve precision the authors
– Interpolate to correct mismatch errors
– Compute sub-pixel depth estimates by interpolation
– Fix final errors by median and bilateral filtering
•
Feel free to read about these ad hoc refinements in the paper.
987
Examples on Kitti
• For awhile this was the top model on Kitti (naturally refinements
and new models have surpassed it).
• But it remains a relatively simple and fundamental highperforming example of deep learning depth from stereo.
988
Examples on Kitti
989
Deep Learning In Context
• Deep learning models sometimes approach human performance
on a wide variety of visual tasks.
• Yet they can fall apart entirely.
• A very reasonable outlook on their limitations is given here.
A top DNN learner was >99.6% confident of these classifications.
990
Comments
• That’s all, folks! Thanks for Coming
991
Download