Robert Laganière uOttawa 2015
Vision:
◦ perception from visual information, including object, colors, distances, structure, motion.
◦ The vision process is also influenced by memory.
Perception:
◦ the faculty of capturing the environment using senses and mind
What is the main organ of vision?
◦ The brain
◦ The eyes simply project light on the retina
Light waves from an object enter the eye through the pupil, the circular aperture in the iris.
The light waves are converged first by the cornea, and then further by the crystalline lens, to a nodal point located immediately behind the back surface of the lens.
◦ At that point, the image becomes inverted
The light progresses through the gelatinous vitreous humor and, ideally, back to a clear focus on the retina
In the retina, light impulses are changed into electrical signals and then sent along the optic nerve and back to the occipital (posterior) lobe of the brain
◦ which interprets these electrical signals as visual images from http://www.tedmontgomery.com/the_eye/
The compound (lenticular) eye
◦ Insects
The projective (camera) eye
Animals have two eyes
◦ for depth perception
◦ for panoramic view
From http://ebiomedia.com/gall/eyes/octopus-insect.html
from http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/V/Vision.html
130 millions of photosensitive cells
1 million of nervous cells exiting each eye
One blind spot
When the ciliary muscle is relaxed, its diameter increases and lens is flattened
Distance from lens to fovea: 17mm
Distance between eyes: 50 to 70mm
Depth perception: from 25cm to 450m
Involuntary eye movements: 10 to 15 seconds of arc every 0.3 sec
Visible portion of the electromagnetic spectrum:
400 to 700 nanometers
Rod cells
Cone cells
◦ Visual angle subtended by fovea:
1-2 degree for rod-free area;
after 5 degrees, there is a drop of 50% in visual resolution
120 million of rod cells
◦ mainly on the periphery
Angle for maximal rod density:
◦ 15 to 20 degrees
Wavelength of maximum rod sensitivity:
◦ 510 nm (green)
More sensitive to light (vision in the dark)
Many rods are connected to one nerve fiber
Responsible for object detection
6 million of cone cells smaller than the rods and 500 times less sensitive
3 types of cones (color perception):
◦ red absorbing cones;
those that absorb best at the relatively long wavelengths peaking at 565 nm
◦ green absorbing cones:
with a peak absorption at 535 nm
◦ blue absorbing cones:
with a peak absorption at 440 nm
Responsible for object identification
Not many animals have cones (bees, birds)
Camera obscura
1816 Nicéphore Niépce’s heliography combines the camera obscura and a photo-sensitive coating on a copper plate
1837 Louis Daguerre’s
Daguerreotype uses silver salt to create images on silver-plated copper developed using vapor of mercury
1851 Frederick Scott Archer invented the negative/positive process allowing unlimited number of reproductions
1882 Etienne-Jules Marey’s chronophotographic gun
1894 Thomas Edison’s
Kinetoscope
1 f
1 u
1 v m
image _ size object _ size
v u m
u f
f
A normal lens is a lens that generates an image that presents a ‘natural perspective’.
◦ It is roughly equivalent to the diagonal of the image size.
◦ This roughly approximates the perceived field of view of the human eye. In the 135 film format, the image size is 24x36mm, the normal lens is therefore 50mm.
The field of view (or angle of view) is the amount of a given scene shown on an image.
◦ The focal lens and the film size define the field of view.
◦ Typical sensor sizes for cameras are:
1/4 in. (2.4mm x 3.2mm), 1/3 in. (3.6mm x 4.8mm), 1/2 in. (4.8mm x 6.4mm),
2/3 in. (6.6mm x 8.8mm), 1 in. (9.6mm x 12.8mm)
The aperture defines the size of the opening in the lens,
◦ it can be adjusted to control the amount of the digital sensor.
◦ The iris diaphragm, behind the lens, controls the lens opening.
◦ It is measured in f-stops aperture _ diameter
◦ The aperture also controls the depth of field which is the distance in front and behind the subject that appears to be in focus.
The smaller the aperture, the larger the depth of field.
The shutter speed defines the exposure to light.
Shutter speed and aperture regulate the degree of exposure to light.
Radiance: the total amount of energy that flows from a light source (W)
Luminance: the amount of energy an observer receive from a source (lm)
Brightness: subjective descriptor of light perception
Reflectance: the fraction of radiant energy that is reflected from a surface
◦ 0.01 for black velvet
◦ 0.8 for flat white wall
◦ 0.93 for snow
To convert luminance into an electric signal
1942 - Close-circuit television
(CCTV)
◦ By the Germans to monitor the launch of V2 rockets
1951 - The Video Recorder
◦ invented by Charles Ginsburg at
Ampex corporation
An image is a 2D signal and an image sequence is a 3D signal.
◦ When serializing an image (for transmission or recording), this one is read line-by-line.
◦ For a constant reading speed, one can:
Increase the number of lines (image resolution) which reduce the frame rate (and cause temporal aliasing).
Increase the frame rate which reduce the number of line per frame
(vertical aliasing).
• The tradeoff solution is to read the image in two passes,
• even lines first
• then odd lines after
• This is interlaced scanning.
from: http://zone.ni.com/devzone/conceptd.nsf/webmain/
BA741C90A118EA778625685E00805643?opendocument&node=201713_us
The difficulty with interlaced scanning is that the image must be written (on a display device) the same way it has been read (by a camera):
◦ a synchronization signal must be added.
◦ Composite signal = luminance + synch.
Horizontal synchronization
(18% of the total signal, 2.3µs each)
Vertical synchronization
(8% of the total signal, 27.1µs each) from: http://zone.ni.com/devzone/devzoneweb.nsf/Opendoc?openagent&BB087524D4052C9E8625685E0080301B
NTSC (1941):
◦ Total number of lines: 525
◦ Number of active lines: 483
◦ Aspect ratio: 4:3
◦ Line frequency: 15.75KHz
◦ Field frequency: 59.94Hz
◦ Bandwidth (monochrome): 4.2MHz
CCIR601 (1986):
◦ Number of pixels per line: 720
◦ 525 lines
◦ YUV 4:2:2
◦ Transmission rate (color): 216Mb/s
HDTV (1996):
◦ Aspect ratio: 16:9
◦ Number of pixels per line: 1920
◦ Number of lines: 1080
◦ 1.4 Mpixels
CCD (charged-coupled device)
◦ An array of discrete imaging elements
(photon detector)
◦ Each photosite is composed of a photodiode and an adjacent charge transfer region arranged in columns
◦ The photodiode accumulates an electric charge proportional to illumination
(number of received photons) time (and also to temperature…).
2 photons produce ~1 electron.
◦ Linear response to light intensity
◦ Uses an electronic shutter.
If the light intensity is too low for a given exposure, it is possible to adjust the gain
(when generating the output signal)
◦ Complex
Synchronization
◦ 2009 Nobel prize in Physics winners
Willard Boyle and George Smith
◦ To be discontinued in 2017… ?
from: https://www.microscopyu.com/articles/digitalimaging/ccdintro.html
CMOS (complementary metal oxide semiconductor)
◦ Each photosite contains a photodiode, a resistor, an amplifier that changes electric charges to voltage and a select transistor
◦ Overlaying the entire pixel array is a grid of metal interconnects, which applies timing and readout signals
◦ Each pixel can be read individually
◦ CMOS sensors are produced using the same manufacturing process as microprocessors
Low-cost
◦ Faster image data transfer rate
◦ Consume little power
20-50mW vs 2-5W for CCDs
◦ Low sensitivity to light
◦ More noisy
The lens model
Only one ray of light per scene point
Image plane is moved in front of the focal point hi = f ho do
x
f
X
Z y
f
Y
Z x
f
x
X
Z
uo y
f
y
Y
Z
vo
These are the intrinsic parameters of the camera
We need many known 3D points and their image
◦ on the same image s
x y
1
f x
0
0
0 f y
0 u
0 v
0
1
1
0
0
0
1
0
0
0
1
0
0
0
X
Y
Z
1
◦ no, multiple images of a 2D target s
x y
1
f x
0
0
0 f y
0 u
0 v
0
1
r 1 r 4 r 7 r 2 r 5 r 8 r 3 r 6 r 9 t t t 1
2
3
X
Y
Z
1
Lens distorsion adds more parameters
◦ The farther away pixels are from the center of the camera, the more distorted they are
What is a color? It is a spectral power distribution (inside the visible spectrum) of the light reflected or transmitted by an object.
◦ Many different spectral power distributions may form the same color.
◦ A pure color is a color composed of only one wavelength (the colors of the rainbow); also called monochromatic color.
◦ A color has a given hue, a given saturation and a given brightness.
Metamer: either of two colors of different spectral composition that appear identical to the eye of a single observer under some lighting conditions.
For a human observer, it is possible to find a metamer for any color by variation of only three primaries
3 techniques
◦ By subtraction (painting) [RBY]
◦ By addition (photography) [RGB]
◦ Hybrid (color printing): printing inks cannot mix; only one color of ink can be allowed to be on a particular point of the picture. [CMYK]
CIE (Comission Internationale de l’Eclairage) is the primary organization that defines color metric standards.
One possible choice for the (monochromatic) primaries (CIE RGB):
Red
◦ Red (700nm)
◦ Green (546.1nm)
Yellow Magenta
◦ Blue (435.8 nm)
Green Blue
Cyan
Gamut: the entire range of colors that a system can reproduce.
Chromaticity coordinates: ratio of each tristimulus value to their sum
from: http://en.wikipedia.org/wiki/CIE_1931_color_space
But X,Y, Z are not visible colors!
[ R ] [ 3.240479 -1.537150 -0.498535 ] [ X ]
[ G ] = [ -0.969256
1.875992
0.041556 ] * [ Y ]
[ B ] [ 0.055648 -0.204043
1.057311 ] [ Z ]
[ X ] [ 0.412453
0.357580
0.180423 ] [ R ]
[ Y ] = [ 0.212671
0.715160
0.072169 ] * [ G ]
[ Z ] [ 0.019334
0.119193
0.950227 ] [ B ]
For digital color television
RGB are converted to YCrCb, mainly for more efficient coding
◦ The Y signal corresponds to the B&W television signal:
Y= 0.299R+0.587G+0.114B
◦ U (Cb) and V (Cr) subtract the luminance values from
R and B (can be negative)
Cr= 0.5R-0.4187G-0.0813B (red to yellow)
Cb= -0.1687R-0.3313G+0.5B (blue to yellow)
8-bit representation:
Y8= 219Y+16
Cr= 112(R-Y)/0.701 + 128
Cb= 112(B-Y)/0.886 + 128
Saturation is a measure of how vivid the color is
(purity, colorfulness).
Hue defines the color type (express as an angle)
Brightness is a visual sensation of the color intensity
Lightness is the perceptual response to luminance
(max(R,G,B)+min(R,G,B))/2
Intensity
(R+G+B)/3
Value
max(R,G,B)
Luminance scale
Brightness scale
The perceptual difference between two colors is not proportional to their distance in the x-y color space
The color space can be made perceptually uniform through the following transformations
with three sensors
with one sensor and Bayer pattern
◦ Each color component can be linearly interpolated from its two (or four) nearest neighbors.
from: http://www.siliconimaging.com/
A histogram is a simple table giving the number of pixels that have a given value in an image
◦ or sometime a set of images.
The histogram of a 8bit gray-level image has
256 entries (bins)
~background ~foreground
A look-up table is a simple one-to-one
(or many-to-one) function that defines how pixel values are transformed into new values.
◦ For an 8-bit image: a 1D array of 256 entries.
◦ Entry i is the new intensity value gray level i
By stretching the histogram
◦ Look-up table: 255.0*(i-i min
)/(i max
-i min
)
To obtain a flat histogram
◦ Based on the normalized cumulative histogram p
◦ Look-up table: 255*p[i]
To detect specific image content
1.
Build the histogram of an image subwindow
2.
Normalize the histogram
Histogram is now view as a probability function that represents the image content
3.
Use the extracted histogram as a look-up table in another image
4.
Apply a threshold in order to extract possible content location
Possible improvement: use ratio of object’s histogram over image’s histogram (saturated at 1.0)
The appropriate color space must be selected, e.g. Hue
◦ Hue for unsaturated pixels is not reliable
Tracking objects using histograms to represent their appearance
Mean-shift is an iterative algorithm that finds the region of highest density
◦ A window is moved in the direction of the center of mass
1.
2.
Start from an initial position
Extract the histogram representation
3.
4.
In the next frame, back-project this histogram
Iteratively move around in the direction of maximal probability
Image similarity can sometime be measure by comparing histograms
Most histogram comparison measures are based on bin-by-bin comparisons
◦ Histogram intersection
◦ Chi-square test
Efficient way of summing pixels in image regions
Obtained by replacing each pixel by the value of the sum of all the pixels located inside the upperleft quadrant delimitated by this pixel
The integral image can be computed by scanning the image once
◦ the integral value of a current pixel is given by the integral value of the pixel above plus the value of the cumulative sum of the current line
Any summation over a rectangular region can be obtained through four pixel accesses
◦ A-C-B+D
Sometimes a fixed threshold does not give good results
A simple solution consists in comparing a pixel with the mean value of the pixels in a given neighborhood
Theory developed in the 1960s for the analysis of discrete signals
It defines a series of operators which transform an image
◦ by probing it with a predefined shape element
Most often applied on binary images foreground background
Structuring element is the fundemental instrument in morphology
◦ It can be of any shape
It defines a configuration of pixels and an origin (anchor point)
◦ The origin of the structuring element is aligned with a given pixel
◦ its intersection with the image defines a set of pixels
◦ on which a particular morphological operation is applied
from: http://www.mif.vu.lt/atpazinimas/dip/FIP/fip-Morpholo.html
In morphology, the convention is to have foreground objects represented by high (white) pixel values and background by low (black) pixel values
Erosion replaces the current pixel with the minimum pixel value found in the defined pixel set
Erosion replaces the current pixel with the minimum pixel value found in the defined pixel set
The complement of an image is obtained by exchanging the foreground with the background
◦ For a gray-level image, it’s the negative of the image, i.e. 255-I
The erosion of an image is equivalent to the complement of the dilation of the complement image.
The dilation of an image is equivalent to the complement of the erosion of the complement image.
Closing is defined as the erosion of the dilation of an image
Opening is defined as the dilation of the erosion of an image
The closing operator fills the holes in an image
The opening operator removes small blobs in an image
Both the opening and the closing operators are idempotent
◦ When they are re-applied to an image, the same result is obtained
The Beucher gradient is defined the difference between the dilated image and the eroded image
The white top-hat operator is defined as the difference between the image and its opening
The black top-hat operator is defined as the difference between the closing of an image and the image itself
Seeing an image as a topological map
◦ Where dark regions are the valleys and brighter regions correspond to hills
A watershed segmentation is obtained by gradually flooding the image starting at level 0 and moving up
◦ As the level of "water" progressively increases (to levels 1, 2, 3, and so on), catchment basins are formed
◦ the water of two different basins will eventually merge
◦ When this happens, a watershed is created in order to keep the two basins separated from: http://fiji.sc/Classic_Watershed
from: http://masters.donntu.org/2010/fknt/tsibulka/library/article2_or.htm
To overcome the over-segmentation problem, the flooding process starts from a predefined set of marked pixels.
The basins created from these markers are labeled in accordance with the values assigned to the initial marks.
When two basins having the same label merge, no watersheds are created
The MSERs are also be created by flooding the image level by level
But we are interested by the basins that remain relatively stable for a period of time during the immersion process.
These regions correspond to some distinctive parts of the scene objects pictured in the image
We are interested by MSERs having a certain area
◦ Minimum and maximum areas are set
The MSERs form a hierarchy
MSER could be obtained by filling the image from dark to bright or from bright to dark
From: http://www.icg.tu-graz.ac.at/Members/donoser/accv2007_donoser/ACCV_2007
From: http://opencv-code.com/tutorials/count-and-segment-overlapping-objects-with-watershed-and-distance-transform/
Looking at the gray-level variations in an image
◦ Some images contain large areas of almost constant intensity (for example, a blue sky)
◦ In other images, the gray-level intensities vary rapidly over the image
(for example, a busy scene crowded with many small objects).
The frequency of those variations in an image constitutes a way of characterizing an image.
◦ This point of view is referred to as the frequency domain
◦ while characterizing an image by observing its gray-level distribution is referred to as the spatial domain
The frequency domain analysis decomposes an image into its frequency content from the lowest to the highest frequencies
◦ Areas where the image intensities vary slowly contain only low frequencies
◦ high frequencies are generated by rapid changes in intensities.
◦ Several well-known transformations exist, such as the Fourier transform or the Cosine transform
a filter is an operation that amplifies certain bands of frequencies of an image while blocking (or reducing) other image frequency bands.
A low-pass filter is a filter which eliminates most of the high-frequency components of an image a high-pass filter eliminates the lowpass components.
A box filter replaces each pixel by the average value of the pixels around
◦ the rapid intensity variations will be smoothed out and replaced by a more gradual transition
◦ It is a low-pass filter
The different weights of a filter can be represented using a matrix
◦ It shows the multiplying factors associated with each pixel position in the considered neighborhood.
◦ The central element of the matrix corresponds to the pixel on which the filter is currently applied.
◦ Such a matrix is called a kernel or a mask.
A n average filter has the following mask:
1/9 1/9 1/9
1/9 1/9 1/9
1/9 1/9 1/9
Applying a linear filter corresponds to moving a kernel over each pixel of an image
from: https://developer.apple.com/library/ios/documentation/Performance/
Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html
◦ multiplying each corresponding pixel by its associated weight.
Mathematically, this operation is called a convolution
I out
(x, y) =
i
j
I in
(x
i, y
j)K(i, j)
Convoluting a MxM image with a NxN kernel involves
MxMxNxN multiply-accumulate operations (MACs)
◦ A bit less if you do not process the borders
In a Gaussian filter, the weight associated with a pixel is proportional to its distance from the central pixel
G ( x , y )
1
2
2 e
x
2
2
y
2
2
1 4 7 4 1
4 16 26 16 4
7 26 41 26 7
◦ The σ (sigma) value controls the width of the resulting Gaussian function.
The greater this value is, the flatter the function will be
4 16 26 16 4
1 4 7 4 1
______
273
e.g.
σ =0.5
:
[0.0 0.0 0.00026 0.10645 0.78657 0.10645 0.00026 0.0 0.0]
e.g.
σ =1.5:
[0.00761 0.036075 0.10959 0.21345 0.26666
0.21345 0.10959 0.03608 0.00761 ]
A median filter replaces the current pixel by the median value of its neighbors
◦ the median of a set is the value at the middle position when the set is sorted
The median filter is not a linear filter
◦ it cannot be represented by a kernel matrix.
Let f(x,y) a continous 2D function, the gradient of f is given by:
f ( x , y )
f
x
,
f
y
By definition, the directional derivative is maximal in the gradient direction
If f(x,y) represents a distribution of intensity values inside an image
◦ then there will be high gradient values when there is a rapid change in brightness
◦ There is an edge in an image at (x,y) if f ( x , y )
Th
f ( x , y )
f
x
2
f
y
2
f ( x , y )
tan
1
f
y
f x
norm of the gradient orientation of the gradient
The Sobel operator is a classic edge detection linear filter that is based on two simple 3x3 kernels
◦ Sobel is a high-pass filter
-1 -2 -1
0 0 0
1 2 1 vertical gradient
-1 0
-2 0
1
2
-1 0 1 horizontal gradient
I ( x , y )
2 x
2 y
x
y
max(
x
,
y
)
Applying the following mask correspond to smooth the image and then derivate it with respect to x
◦ This way gradient masks of different size can be obtained
◦ They extract edges at different ‘scales’
The Laplacian of a function is
2 f ( x , y )
2
x
2 f
2
y
2 f
◦ The Laplacian is a second-order derivative
It is a high-pass filter
◦ It can be approximated by
0 1 0 1 1 1
1 -4 1 or
1 -8 1
0 1 0 1 1 1
The edges of an image are located at the zero-crossings of the Laplacian function
◦ No threshold
◦ Sub-pixel localization
◦ Very sensitive to noise
To reduce the sensitivity to noise
◦ low-pass the image before
using a gaussian filter
It is also possible to compute the
Laplacian using larger kernels
LoG ( x , y )
1
4
[ 1
x
2
2
2 y
2
] e
x
2 y
2
2
2
σ =1.4
DoG provides an approximation for LoG from: P.J. Burt, E.H. Adelson The Laplacian as a compact image code, IEEE trans. on Comm.-32, no 4, April 1983.
CNNs are classifier based on multiple layers of image convolutions
◦ The value of the kernel parameters are learned during a training phase
◦ Each layer is composed of a filter bank, a non-linearity and a feature pooling/sub-sampling
◦ The output of each layer is a set of feature maps first layer
Multi-kernel convolution
Pooling
(max, average,…)
Multi-kernel convolution
Pooling
(max, average,…)
CNNs are just scaled up
Perceptrons!
◦ The perceptron was a simple linear classifier (a neuron) designed in the 50s
Weights are learned from the samples
◦ Multi-layer perceptron were introduced
1 or 2 hidden layers
◦ Deep network can have 5-10 layers
The first layers involve multiple convolutions with different kernels
The few last layers are (often fullyconnected) perceptrons w
1
Feature vector w
2 w
3 w
4
This is a machine learning task
Positive samples and Negative samples are used to train a classfier
◦ The machine must learn how to distinguish the objects of interest from the rest
◦ It must learns a function that separate the data into two groups
negatives and positives
This is an offline process, once the function is learned, the classifier is ready to be used
◦ The classifier can be trained to tell if an image shows a certain class of object
◦ Or it can be trained as an object detector
No
Yes
Person?
The objective of object detection is to detect, in an image,
◦ specific objects (e.g. pedestrians)
◦ or class of objects (e.g. vehicles)
An object detector is a method that can tell if an object is present in an image sub-window
◦ For complete detection, we test all possible windows in all frames of a sequence
Apply the filter bank
◦ Each input image of size MxM is convoluted with K kernels each of size NxN
KxMxMxNxN MAC operations
Applying the non-linearity
◦ usually done through look-up tables
Performing pooling
◦ Pooling aggregates the values of a VxV regions by applying an average or a max operation
◦ The image is subsampled by applying the pooling every P pixels
◦ (MxM)/(PxP) pooling operations over sets of size VxV
Each fully connected layer of a perceptron involves L i xL o
MAC operations where L is the number of neurons (in input and output layers)
AlexNet is a reference network because it won the 2012 ImageNet competition by making 40% less error than the next best competitor
◦ It is composed of 5 convolutional layers and uses
3D kernels
◦ The input is a color RGB image
◦ Computation is divided over 2 GPU architectures
◦ Learning uses artificial data augmentation and connection drop-out to avoid over-fitting
The first layer applies 96 kernels of size
3x11x11
◦ 34,848 parameters
◦ Each kernel is applied with a stride of 4 pixels
◦ 1,098,075 MACs
The second layer applies 256 kernels of size 48x5x5
◦ After applying a 2x2 max pooling
◦ 307,200 parameters
◦ 256x(48x5x5)x(27x27)=223,948,800 MACs
The third layer applies 384 kernels of size
128x3x3
◦ After applying a 2x2 max pooling
◦ 442,368 parameters
◦ 384x(128x3x3)x(13x13)=74,760,192 MACs
The fourth layer applies 384 kernels of size 192x3x3
◦ Without pooling
◦ 663,552 parameters
◦ 384x(192x3x3)x(13x13)=112,140,288 MACs
The fifth layer applies 256 kernels of size
192x3x3
◦ Without pooling
◦ 442,368 parameters
◦ 256x(192x3x3)x(13x13)=74,760,192 MACs
The output of the fifth layer (after a 2x2 max pooling) is connected to a fully connected 3-layer perceptron
◦ 1 st layer
(2x6x6*128)*4096= 37,748,736 connections
◦ 2 nd layer
4096x4096= 16,777,216 connections
◦ 3 rd layer
4096x1000= 4,096,000 connections
Histograms of gradient can be used to create an image representation that is illumination invariant
This is the Histogram of Oriented
Gradient (HOG) descriptor
◦ Many classifiers use the HOG representation as input
The image is subdivided into cells
◦ e.g. 8x8 cells from: http://vbie.eic.nctu.edu.tw/technical.php?index=57
◦ Gradient orientation is computed for each pixel of a cell
Orientation are discretized into a small number of values
e.g. 9 values over 180º (unsigned gradients)
For each cell a histogram is created by having the pixel voting for each orientation bin
◦ weighted by the gradient magnitude
A block is created by grouping cells
◦ e.g. 1 block = 2x2 cells
◦ The histograms of a block are concatenated to form a vector
that is then normalized (to sum to one)
The different blocks can overlap
(i.e. share cells) from: http://www.cs.cornell.edu/courses/cs4670/2013fa/lectures/lectures.html
from: HOGgles: Visualizing Object Detection FeaturesdC Vondrick, A Khosla, T Malisiewicz, A Torralba, ICCV 2013.
The Canny edge detector uses the result of a regular edge detector on which is applied two different thresholds
The Canny algorithm combines these two edge maps in order to produce an "optimal" map of contours.
It keeps only the edge points of the low-threshold edge map for which a continuous path of edges exists, linking that edge point to an edge belonging to the high-threshold edge map.
◦ all edge points of the high-threshold map are kept
◦ all isolated chains of edge points in the low-threshold map are removed.
Equation of a line
ρ = x cos θ + y sin θ
A line is then a point in the ( ϱ,ϴ ) space
Create an accumulator A( ϱ,ϴ )
◦ Each dimension is quantized in a number of bins
For each point (x,y)
◦ For each ϴ
Compute ϱ
Increment A( ϱ,ϴ )
Lines are found at local maxima of A
Here are the lines that received at least
80 votes or 50 votes
First, instead of systematically scanning the image row-by-row, points are chosen in random order
Whenever an entry of the accumulator reaches the specified minimum value, the image is scanned along the corresponding line and all points passing through it are removed
◦ even if they have not voted yet
◦ This scanning also determines the length of the segments that will be accepted:
minimum length for a segment
maximum pixel gap that is permitted to form a continuous segment.
Would require a 3D accumulator
◦ A(r,x,y)
Observing that gradient points in the direction of circle center
◦ Each point votes for possible centers
Using Rmin and Rmax
◦ Once centers identified, each point votes for a radius
The bounding box of a component is defined as the upright rectangle of minimum size that completely contains the shape
The minimum enclosing circle is the minimum circle that contains the shape
The polygonal approximation of a component is created by specifying an accuracy parameter giving the maximal acceptable distance between a shape and its simplified polygon
The convex hull, or convex envelope , of a shape is the minimal convex polygon that encompass a shape.
◦ It can be visualized as the shape that an elastic band would take if placed around the component
Polygonal approximation
Convex hull
Bounding box
Enclosing circle
Image moments
◦ Centroid (M
10
/M
00
,M
01
/M
00
)
Central moments
◦ Translation invariant
Hue moments
◦ Rotation invariant