Computer Vision

Introduction to

Computer Vision

Robert Laganière uOttawa 2015

A. The human visual system







Vision:

◦ perception from visual information, including object, colors, distances, structure, motion.

◦ The vision process is also influenced by memory.

Perception:

◦ the faculty of capturing the environment using senses and mind

What is the main organ of vision?

◦ The brain

◦ The eyes simply project light on the retina

The visual process









Light waves from an object enter the eye through the pupil, the circular aperture in the iris.

The light waves are converged first by the cornea, and then further by the crystalline lens, to a nodal point located immediately behind the back surface of the lens.

◦ At that point, the image becomes inverted

The light progresses through the gelatinous vitreous humor and, ideally, back to a clear focus on the retina

In the retina, light impulses are changed into electrical signals and then sent along the optic nerve and back to the occipital (posterior) lobe of the brain

◦ which interprets these electrical signals as visual images from http://www.tedmontgomery.com/the_eye/

Types of eyes







The compound (lenticular) eye

◦ Insects

The projective (camera) eye

Animals have two eyes

◦ for depth perception

◦ for panoramic view

From http://ebiomedia.com/gall/eyes/octopus-insect.html

from http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/V/Vision.html

The human eye



















130 millions of photosensitive cells

1 million of nervous cells exiting each eye

One blind spot

When the ciliary muscle is relaxed, its diameter increases and lens is flattened

Distance from lens to fovea: 17mm

Distance between eyes: 50 to 70mm

Depth perception: from 25cm to 450m

Involuntary eye movements: 10 to 15 seconds of arc every 0.3 sec

Visible portion of the electromagnetic spectrum:

400 to 700 nanometers

Photosensitive cells





Rod cells

Cone cells

◦ Visual angle subtended by fovea:

 1-2 degree for rod-free area;

 after 5 degrees, there is a drop of 50% in visual resolution

Rod cells













120 million of rod cells

◦ mainly on the periphery

Angle for maximal rod density:

◦ 15 to 20 degrees

Wavelength of maximum rod sensitivity:

◦ 510 nm (green)

More sensitive to light (vision in the dark)

Many rods are connected to one nerve fiber

Responsible for object detection

Cone cells











6 million of cone cells smaller than the rods and 500 times less sensitive

3 types of cones (color perception):

◦ red absorbing cones;

 those that absorb best at the relatively long wavelengths peaking at 565 nm

◦ green absorbing cones:

 with a peak absorption at 535 nm

◦ blue absorbing cones:

 with a peak absorption at 440 nm

Responsible for object identification

Not many animals have cones (bees, birds)

B. Image formation



Camera obscura

Photography







1816 Nicéphore Niépce’s heliography combines the camera obscura and a photo-sensitive coating on a copper plate

1837 Louis Daguerre’s

Daguerreotype uses silver salt to create images on silver-plated copper developed using vapor of mercury

1851 Frederick Scott Archer invented the negative/positive process allowing unlimited number of reproductions

Chronophotography



1882 Etienne-Jules Marey’s chronophotographic gun



1894 Thomas Edison’s

Kinetoscope

The lens equation

1 f



1 u



1 v m

 image _ size object _ size

 v u m

 u f

 f











Few definitions

A normal lens is a lens that generates an image that presents a ‘natural perspective’.

◦ It is roughly equivalent to the diagonal of the image size.

◦ This roughly approximates the perceived field of view of the human eye. In the 135 film format, the image size is 24x36mm, the normal lens is therefore 50mm.

The field of view (or angle of view) is the amount of a given scene shown on an image.

◦ The focal lens and the film size define the field of view.

◦ Typical sensor sizes for cameras are:

 1/4 in. (2.4mm x 3.2mm), 1/3 in. (3.6mm x 4.8mm), 1/2 in. (4.8mm x 6.4mm),

2/3 in. (6.6mm x 8.8mm), 1 in. (9.6mm x 12.8mm)

The aperture defines the size of the opening in the lens,

◦ it can be adjusted to control the amount of the digital sensor.

◦ The iris diaphragm, behind the lens, controls the lens opening.

◦ It is measured in f-stops aperture _ diameter

◦ The aperture also controls the depth of field which is the distance in front and behind the subject that appears to be in focus.

 The smaller the aperture, the larger the depth of field.

The shutter speed defines the exposure to light.

Shutter speed and aperture regulate the degree of exposure to light.

Image capture









Radiance: the total amount of energy that flows from a light source (W)

Luminance: the amount of energy an observer receive from a source (lm)

Brightness: subjective descriptor of light perception

Reflectance: the fraction of radiant energy that is reflected from a surface

◦ 0.01 for black velvet

◦ 0.8 for flat white wall

◦ 0.93 for snow

Television







To convert luminance into an electric signal

1942 - Close-circuit television

(CCTV)

◦ By the Germans to monitor the launch of V2 rockets

1951 - The Video Recorder

◦ invented by Charles Ginsburg at

Ampex corporation

The NTSC standard for broadcasting



An image is a 2D signal and an image sequence is a 3D signal.

◦ When serializing an image (for transmission or recording), this one is read line-by-line.

◦ For a constant reading speed, one can:

 Increase the number of lines (image resolution) which reduce the frame rate (and cause temporal aliasing).

 Increase the frame rate which reduce the number of line per frame

(vertical aliasing).

• The tradeoff solution is to read the image in two passes,

• even lines first

• then odd lines after

• This is interlaced scanning.

from: http://zone.ni.com/devzone/conceptd.nsf/webmain/

BA741C90A118EA778625685E00805643?opendocument&node=201713_us

The NTSC standard for broadcasting



The difficulty with interlaced scanning is that the image must be written (on a display device) the same way it has been read (by a camera):

◦ a synchronization signal must be added.

◦ Composite signal = luminance + synch.

 Horizontal synchronization

 (18% of the total signal, 2.3µs each)

 Vertical synchronization

 (8% of the total signal, 27.1µs each) from: http://zone.ni.com/devzone/devzoneweb.nsf/Opendoc?openagent&BB087524D4052C9E8625685E0080301B

Video standard







NTSC (1941):

◦ Total number of lines: 525

◦ Number of active lines: 483

◦ Aspect ratio: 4:3

◦ Line frequency: 15.75KHz

◦ Field frequency: 59.94Hz

◦ Bandwidth (monochrome): 4.2MHz

CCIR601 (1986):

◦ Number of pixels per line: 720

◦ 525 lines

◦ YUV 4:2:2

◦ Transmission rate (color): 216Mb/s

HDTV (1996):

◦ Aspect ratio: 16:9

◦ Number of pixels per line: 1920

◦ Number of lines: 1080

◦ 1.4 Mpixels

Digital sensors



CCD (charged-coupled device)

◦ An array of discrete imaging elements

(photon detector)

◦ Each photosite is composed of a photodiode and an adjacent charge transfer region arranged in columns

◦ The photodiode accumulates an electric charge proportional to illumination

(number of received photons) time (and also to temperature…).

 2 photons produce ~1 electron.

◦ Linear response to light intensity

◦ Uses an electronic shutter.

 If the light intensity is too low for a given exposure, it is possible to adjust the gain

(when generating the output signal)

◦ Complex

 Synchronization

◦ 2009 Nobel prize in Physics winners

Willard Boyle and George Smith

◦ To be discontinued in 2017… ?

from: https://www.microscopyu.com/articles/digitalimaging/ccdintro.html

Digital sensors



CMOS (complementary metal oxide semiconductor)

◦ Each photosite contains a photodiode, a resistor, an amplifier that changes electric charges to voltage and a select transistor

◦ Overlaying the entire pixel array is a grid of metal interconnects, which applies timing and readout signals

◦ Each pixel can be read individually

◦ CMOS sensors are produced using the same manufacturing process as microprocessors

 Low-cost

◦ Faster image data transfer rate

◦ Consume little power

 20-50mW vs 2-5W for CCDs

◦ Low sensitivity to light

◦ More noisy

C. Digital images



The lens model

The pin-hole camera model





Only one ray of light per scene point

Image plane is moved in front of the focal point hi = f ho do

From 3D points to image pixels

x

 f

X

Z y

 f

Y

Z x

 f

 x

X

Z

 uo y

 f

 y

Y

Z

 vo

These are the intrinsic parameters of the camera

Calibrating a camera



We need many known 3D points and their image

◦ on the same image s







 x y

1

















 f x

0

0

0 f y

0 u

0 v

0

1





1





0

0









0

1

0

0

0

1

0

0

0



















X

Y

Z

1











◦ no, multiple images of a 2D target s 





 x y

1

















 f x

0

0

0 f y

0 u

0 v

0

1















 r 1 r 4 r 7 r 2 r 5 r 8 r 3 r 6 r 9 t t t 1

2

3



















X

Y

Z

1











Calibrating the lens



Lens distorsion adds more parameters

◦ The farther away pixels are from the center of the camera, the more distorted they are

3. Color representation





What is a color? It is a spectral power distribution (inside the visible spectrum) of the light reflected or transmitted by an object.

◦ Many different spectral power distributions may form the same color.

◦ A pure color is a color composed of only one wavelength (the colors of the rainbow); also called monochromatic color.

◦ A color has a given hue, a given saturation and a given brightness.

Metamer: either of two colors of different spectral composition that appear identical to the eye of a single observer under some lighting conditions.

Color primaries





For a human observer, it is possible to find a metamer for any color by variation of only three primaries

3 techniques

◦ By subtraction (painting) [RBY]

◦ By addition (photography) [RGB]

◦ Hybrid (color printing): printing inks cannot mix; only one color of ink can be allowed to be on a particular point of the picture. [CMYK]

Standard primaries





CIE (Comission Internationale de l’Eclairage) is the primary organization that defines color metric standards.

One possible choice for the (monochromatic) primaries (CIE RGB):

Red

◦ Red (700nm)

◦ Green (546.1nm)

Yellow Magenta

◦ Blue (435.8 nm)

Green Blue





Cyan

Gamut: the entire range of colors that a system can reproduce.

Chromaticity coordinates: ratio of each tristimulus value to their sum

CIE 1931 color space

from: http://en.wikipedia.org/wiki/CIE_1931_color_space

CIE XYZ primaries

But X,Y, Z are not visible colors!

CCIR Rec709 primaries

[ R ] [ 3.240479 -1.537150 -0.498535 ] [ X ]

[ G ] = [ -0.969256

1.875992

0.041556 ] * [ Y ]

[ B ] [ 0.055648 -0.204043

1.057311 ] [ Z ]

[ X ] [ 0.412453

0.357580

0.180423 ] [ R ]

[ Y ] = [ 0.212671

0.715160

0.072169 ] * [ G ]

[ Z ] [ 0.019334

0.119193

0.950227 ] [ B ]

YUV or YCrCb (CCIR 601)







For digital color television

RGB are converted to YCrCb, mainly for more efficient coding

◦ The Y signal corresponds to the B&W television signal:

 Y= 0.299R+0.587G+0.114B

◦ U (Cb) and V (Cr) subtract the luminance values from

R and B (can be negative)

 Cr= 0.5R-0.4187G-0.0813B (red to yellow)

 Cb= -0.1687R-0.3313G+0.5B (blue to yellow)

8-bit representation:

 Y8= 219Y+16

 Cr= 112(R-Y)/0.701 + 128

 Cb= 112(B-Y)/0.886 + 128

Hue Saturation Brightness













Saturation is a measure of how vivid the color is

(purity, colorfulness).

Hue defines the color type (express as an angle)

Brightness is a visual sensation of the color intensity

Lightness is the perceptual response to luminance

 (max(R,G,B)+min(R,G,B))/2

Intensity

 (R+G+B)/3

Value

 max(R,G,B)

Luminance scale

Brightness scale

HSV examples

Perceptually uniform color space





The perceptual difference between two colors is not proportional to their distance in the x-y color space

The color space can be made perceptually uniform through the following transformations

Color cameras

 with three sensors

 with one sensor and Bayer pattern

◦ Each color component can be linearly interpolated from its two (or four) nearest neighbors.

from: http://www.siliconimaging.com/

4. Histograms





A histogram is a simple table giving the number of pixels that have a given value in an image

◦ or sometime a set of images.

The histogram of a 8bit gray-level image has

256 entries (bins)

Thresholding an image

~background ~foreground

Look-up table



A look-up table is a simple one-to-one

(or many-to-one) function that defines how pixel values are transformed into new values.

◦ For an 8-bit image: a 1D array of 256 entries.

◦ Entry i is the new intensity value gray level i

Contrast enhancement



By stretching the histogram

◦ Look-up table: 255.0*(i-i min

)/(i max

-i min

)

Histogram equalization



To obtain a flat histogram

◦ Based on the normalized cumulative histogram p

◦ Look-up table: 255*p[i]

Histogram back-projection



To detect specific image content

1.

Build the histogram of an image subwindow

2.

Normalize the histogram

 Histogram is now view as a probability function that represents the image content

Histogram back-projection

3.

Use the extracted histogram as a look-up table in another image

4.

Apply a threshold in order to extract possible content location

 Possible improvement: use ratio of object’s histogram over image’s histogram (saturated at 1.0)

Color back-projection



The appropriate color space must be selected, e.g. Hue

◦ Hue for unsaturated pixels is not reliable

The mean-shift algorithm





Tracking objects using histograms to represent their appearance

Mean-shift is an iterative algorithm that finds the region of highest density

◦ A window is moved in the direction of the center of mass


1.

2.

Start from an initial position

Extract the histogram representation


3.

4.

In the next frame, back-project this histogram

Iteratively move around in the direction of maximal probability

Histogram comparison





Image similarity can sometime be measure by comparing histograms

Most histogram comparison measures are based on bin-by-bin comparisons

◦ Histogram intersection

◦ Chi-square test

Image ordered by histogram similarity

Integral images









Efficient way of summing pixels in image regions

Obtained by replacing each pixel by the value of the sum of all the pixels located inside the upperleft quadrant delimitated by this pixel

The integral image can be computed by scanning the image once

◦ the integral value of a current pixel is given by the integral value of the pixel above plus the value of the cumulative sum of the current line

Any summation over a rectangular region can be obtained through four pixel accesses

◦ A-C-B+D

Adaptive thresholding





Sometimes a fixed threshold does not give good results

A simple solution consists in comparing a pixel with the mean value of the pixels in a given neighborhood

5. Mathematical morphology







Theory developed in the 1960s for the analysis of discrete signals

It defines a series of operators which transform an image

◦ by probing it with a predefined shape element

Most often applied on binary images foreground background

The structuring element





Structuring element is the fundemental instrument in morphology

◦ It can be of any shape

It defines a configuration of pixels and an origin (anchor point)

◦ The origin of the structuring element is aligned with a given pixel

◦ its intersection with the image defines a set of pixels

◦ on which a particular morphological operation is applied

A binary image

from: http://www.mif.vu.lt/atpazinimas/dip/FIP/fip-Morpholo.html

In morphology, the convention is to have foreground objects represented by high (white) pixel values and background by low (black) pixel values

Erosion



Erosion replaces the current pixel with the minimum pixel value found in the defined pixel set

Erosion example (3x3 SE)

Erosion example (7x7 SE)

Dilation



Erosion replaces the current pixel with the minimum pixel value found in the defined pixel set

Dilation example

Complementary operators







The complement of an image is obtained by exchanging the foreground with the background

◦ For a gray-level image, it’s the negative of the image, i.e. 255-I

The erosion of an image is equivalent to the complement of the dilation of the complement image.

The dilation of an image is equivalent to the complement of the erosion of the complement image.

Closing



Closing is defined as the erosion of the dilation of an image

Closing example

Opening



Opening is defined as the dilation of the erosion of an image

Opening example

Idempotency







The closing operator fills the holes in an image

The opening operator removes small blobs in an image

Both the opening and the closing operators are idempotent

◦ When they are re-applied to an image, the same result is obtained

Beucher gradient



The Beucher gradient is defined the difference between the dilated image and the eroded image

Top-hat transform





The white top-hat operator is defined as the difference between the image and its opening

The black top-hat operator is defined as the difference between the closing of an image and the image itself

The watershed algorithm



Seeing an image as a topological map

◦ Where dark regions are the valleys and brighter regions correspond to hills



A watershed segmentation is obtained by gradually flooding the image starting at level 0 and moving up

◦ As the level of "water" progressively increases (to levels 1, 2, 3, and so on), catchment basins are formed

◦ the water of two different basins will eventually merge

◦ When this happens, a watershed is created in order to keep the two basins separated from: http://fiji.sc/Classic_Watershed

Watershed over-segmentation

from: http://masters.donntu.org/2010/fknt/tsibulka/library/article2_or.htm

Modified watershed







To overcome the over-segmentation problem, the flooding process starts from a predefined set of marked pixels.

The basins created from these markers are labeled in accordance with the values assigned to the initial marks.

When two basins having the same label merge, no watersheds are created

MSER: Maximally Stable Extremal

Regions



The MSERs are also be created by flooding the image level by level



But we are interested by the basins that remain relatively stable for a period of time during the immersion process.



These regions correspond to some distinctive parts of the scene objects pictured in the image

MSER





We are interested by MSERs having a certain area

◦ Minimum and maximum areas are set

The MSERs form a hierarchy

MSER example



MSER could be obtained by filling the image from dark to bright or from bright to dark

From: http://www.icg.tu-graz.ac.at/Members/donoser/accv2007_donoser/ACCV_2007

Watershed example

From: http://opencv-code.com/tutorials/count-and-segment-overlapping-objects-with-watershed-and-distance-transform/

6. Filtering







Looking at the gray-level variations in an image

◦ Some images contain large areas of almost constant intensity (for example, a blue sky)

◦ In other images, the gray-level intensities vary rapidly over the image

(for example, a busy scene crowded with many small objects).

The frequency of those variations in an image constitutes a way of characterizing an image.

◦ This point of view is referred to as the frequency domain

◦ while characterizing an image by observing its gray-level distribution is referred to as the spatial domain

The frequency domain analysis decomposes an image into its frequency content from the lowest to the highest frequencies

◦ Areas where the image intensities vary slowly contain only low frequencies

◦ high frequencies are generated by rapid changes in intensities.

◦ Several well-known transformations exist, such as the Fourier transform or the Cosine transform

Filters





 a filter is an operation that amplifies certain bands of frequencies of an image while blocking (or reducing) other image frequency bands.

A low-pass filter is a filter which eliminates most of the high-frequency components of an image a high-pass filter eliminates the lowpass components.



Box filter

A box filter replaces each pixel by the average value of the pixels around

◦ the rapid intensity variations will be smoothed out and replaced by a more gradual transition

◦ It is a low-pass filter

Kernel





The different weights of a filter can be represented using a matrix

◦ It shows the multiplying factors associated with each pixel position in the considered neighborhood.

◦ The central element of the matrix corresponds to the pixel on which the filter is currently applied.

◦ Such a matrix is called a kernel or a mask.

A n average filter has the following mask:

1/9 1/9 1/9

1/9 1/9 1/9

1/9 1/9 1/9

Convolution



Applying a linear filter corresponds to moving a kernel over each pixel of an image

 from: https://developer.apple.com/library/ios/documentation/Performance/

Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html

◦ multiplying each corresponding pixel by its associated weight.

Mathematically, this operation is called a convolution

I out

(x, y) =

 i

 j

I in

(x

 i, y

 j)K(i, j)



Convoluting a MxM image with a NxN kernel involves

MxMxNxN multiply-accumulate operations (MACs)

◦ A bit less if you do not process the borders

Gaussian filter



In a Gaussian filter, the weight associated with a pixel is proportional to its distance from the central pixel

G ( x , y )



1

2



2 e

 x

2

2

 y

2

 2

1 4 7 4 1

4 16 26 16 4

7 26 41 26 7

◦ The σ (sigma) value controls the width of the resulting Gaussian function.

 The greater this value is, the flatter the function will be

4 16 26 16 4

1 4 7 4 1

______

273

Gaussian filter

 e.g.

σ =0.5

:

[0.0 0.0 0.00026 0.10645 0.78657 0.10645 0.00026 0.0 0.0]

 e.g.

σ =1.5:

[0.00761 0.036075 0.10959 0.21345 0.26666

0.21345 0.10959 0.03608 0.00761 ]

Gaussian filtered image

Median filter





A median filter replaces the current pixel by the median value of its neighbors

◦ the median of a set is the value at the middle position when the set is sorted

The median filter is not a linear filter

◦ it cannot be represented by a kernel matrix.

Median filtered image

Image gradient



Let f(x,y) a continous 2D function, the gradient of f is given by:



 f ( x , y )







 f

 x

,

 f

 y









By definition, the directional derivative is maximal in the gradient direction

If f(x,y) represents a distribution of intensity values inside an image

◦ then there will be high gradient values when there is a rapid change in brightness

◦ There is an edge in an image at (x,y) if  f ( x , y )



Th

 f ( x , y )



 f

 x

2







 f

 y





2

 f ( x , y )

 tan



1















 f

 y



 f x













 norm of the gradient orientation of the gradient

Sobel operator



The Sobel operator is a classic edge detection linear filter that is based on two simple 3x3 kernels

◦ Sobel is a high-pass filter

-1 -2 -1

0 0 0

1 2 1 vertical gradient

-1 0

-2 0

1

2

-1 0 1 horizontal gradient

Sobel image

Sobel image and edges



I ( x , y )

  2 x

  2 y

  x

  y

 max(

 x

,

 y

)

Image derivatives using Gaussian kernels



Applying the following mask correspond to smooth the image and then derivate it with respect to x

◦ This way gradient masks of different size can be obtained

◦ They extract edges at different ‘scales’

Laplacian



The Laplacian of a function is

 2 f ( x , y )



 2

 x

2 f



 2

 y

2 f

◦ The Laplacian is a second-order derivative

 It is a high-pass filter

◦ It can be approximated by

0 1 0 1 1 1

1 -4 1 or

1 -8 1

0 1 0 1 1 1

Edge detection and Laplacian



The edges of an image are located at the zero-crossings of the Laplacian function

◦ No threshold

◦ Sub-pixel localization

◦ Very sensitive to noise

Laplacian edge detection



To reduce the sensitivity to noise

◦ low-pass the image before

 using a gaussian filter

Laplacian of Gaussian (LoG)



It is also possible to compute the

Laplacian using larger kernels

LoG ( x , y )

 

1



4

[ 1

 x

2 

2



2 y

2

] e

 x

2  y

2

2

 2

σ =1.4

Difference of Gaussian (DoG)



DoG provides an approximation for LoG from: P.J. Burt, E.H. Adelson The Laplacian as a compact image code, IEEE trans. on Comm.-32, no 4, April 1983.

Zero-crossings of DoG

Convolutional Neural Networks



CNNs are classifier based on multiple layers of image convolutions

◦ The value of the kernel parameters are learned during a training phase

◦ Each layer is composed of a filter bank, a non-linearity and a feature pooling/sub-sampling

◦ The output of each layer is a set of feature maps first layer

Multi-kernel convolution

Pooling

(max, average,…)

Multi-kernel convolution

Pooling

(max, average,…)

Convolutional Neural Networks



CNNs are just scaled up

Perceptrons!

◦ The perceptron was a simple linear classifier (a neuron) designed in the 50s

 Weights are learned from the samples

◦ Multi-layer perceptron were introduced

 1 or 2 hidden layers

◦ Deep network can have 5-10 layers

 The first layers involve multiple convolutions with different kernels

 The few last layers are (often fullyconnected) perceptrons w

1

Feature vector w

2 w

3 w

4

Training a CNN







This is a machine learning task

Positive samples and Negative samples are used to train a classfier

◦ The machine must learn how to distinguish the objects of interest from the rest

◦ It must learns a function that separate the data into two groups

 negatives and positives

This is an offline process, once the function is learned, the classifier is ready to be used

◦ The classifier can be trained to tell if an image shows a certain class of object

◦ Or it can be trained as an object detector

Object detector

No

Yes

Person?





The objective of object detection is to detect, in an image,

◦ specific objects (e.g. pedestrians)

◦ or class of objects (e.g. vehicles)

An object detector is a method that can tell if an object is present in an image sub-window

◦ For complete detection, we test all possible windows in all frames of a sequence

Deep Learning builds a hierarchical representation

Complexity of a CNN classifier









Apply the filter bank

◦ Each input image of size MxM is convoluted with K kernels each of size NxN

 KxMxMxNxN MAC operations

Applying the non-linearity

◦ usually done through look-up tables

Performing pooling

◦ Pooling aggregates the values of a VxV regions by applying an average or a max operation

◦ The image is subsampled by applying the pooling every P pixels

◦ (MxM)/(PxP) pooling operations over sets of size VxV

Each fully connected layer of a perceptron involves L i xL o

MAC operations where L is the number of neurons (in input and output layers)

Example: AlexNet



AlexNet is a reference network because it won the 2012 ImageNet competition by making 40% less error than the next best competitor

◦ It is composed of 5 convolutional layers and uses

3D kernels

◦ The input is a color RGB image

◦ Computation is divided over 2 GPU architectures

◦ Learning uses artificial data augmentation and connection drop-out to avoid over-fitting

AlexNet in details



The first layer applies 96 kernels of size

3x11x11

◦ 34,848 parameters

◦ Each kernel is applied with a stride of 4 pixels

◦ 1,098,075 MACs

AlexNet in details



The second layer applies 256 kernels of size 48x5x5

◦ After applying a 2x2 max pooling


◦ 256x(48x5x5)x(27x27)=223,948,800 MACs

AlexNet in details



The third layer applies 384 kernels of size

128x3x3

◦ After applying a 2x2 max pooling


◦ 384x(128x3x3)x(13x13)=74,760,192 MACs

AlexNet in details



The fourth layer applies 384 kernels of size 192x3x3

◦ Without pooling


◦ 384x(192x3x3)x(13x13)=112,140,288 MACs

AlexNet in details



The fifth layer applies 256 kernels of size

192x3x3

◦ Without pooling


◦ 256x(192x3x3)x(13x13)=74,760,192 MACs

AlexNet in details



The output of the fifth layer (after a 2x2 max pooling) is connected to a fully connected 3-layer perceptron

◦ 1 st layer

 (2x6x6*128)*4096= 37,748,736 connections

◦ 2 nd layer

 4096x4096= 16,777,216 connections

◦ 3 rd layer

 4096x1000= 4,096,000 connections

Histograms and Gradients





Histograms of gradient can be used to create an image representation that is illumination invariant

This is the Histogram of Oriented

Gradient (HOG) descriptor

◦ Many classifiers use the HOG representation as input

Histogram of Oriented Gradients



The image is subdivided into cells

◦ e.g. 8x8 cells from: http://vbie.eic.nctu.edu.tw/technical.php?index=57

◦ Gradient orientation is computed for each pixel of a cell

 Orientation are discretized into a small number of values

 e.g. 9 values over 180º (unsigned gradients)

Histogram of Oriented Gradients





For each cell a histogram is created by having the pixel voting for each orientation bin

◦ weighted by the gradient magnitude

A block is created by grouping cells

◦ e.g. 1 block = 2x2 cells

◦ The histograms of a block are concatenated to form a vector

 that is then normalized (to sum to one)

 The different blocks can overlap

(i.e. share cells) from: http://www.cs.cornell.edu/courses/cs4670/2013fa/lectures/lectures.html

HOG visualization

from: HOGgles: Visualizing Object Detection FeaturesdC Vondrick, A Khosla, T Malisiewicz, A Torralba, ICCV 2013.

HOG hallucination

7. Detecting lines and contours

Canny edge detector



The Canny edge detector uses the result of a regular edge detector on which is applied two different thresholds

Canny hysteresis thresholding





The Canny algorithm combines these two edge maps in order to produce an "optimal" map of contours.

It keeps only the edge points of the low-threshold edge map for which a continuous path of edges exists, linking that edge point to an edge belonging to the high-threshold edge map.

◦ all edge points of the high-threshold map are kept

◦ all isolated chains of edge points in the low-threshold map are removed.

Line detection and the Hough transform



Equation of a line

ρ = x cos θ + y sin θ



A line is then a point in the ( ϱ,ϴ ) space

Hough line detection algorithm







Create an accumulator A( ϱ,ϴ )

◦ Each dimension is quantized in a number of bins

For each point (x,y)

◦ For each ϴ

 Compute ϱ

 Increment A( ϱ,ϴ )

Lines are found at local maxima of A

Hough line detection



Here are the lines that received at least

80 votes or 50 votes

Probabilistic Hough transform





First, instead of systematically scanning the image row-by-row, points are chosen in random order

Whenever an entry of the accumulator reaches the specified minimum value, the image is scanned along the corresponding line and all points passing through it are removed

◦ even if they have not voted yet

◦ This scanning also determines the length of the segments that will be accepted:

 minimum length for a segment

 maximum pixel gap that is permitted to form a continuous segment.

Hough transform for circles





Would require a 3D accumulator

◦ A(r,x,y)

Observing that gradient points in the direction of circle center

◦ Each point votes for possible centers

 Using Rmin and Rmax

◦ Once centers identified, each point votes for a radius

Hough transform for circles

Shape descriptors









The bounding box of a component is defined as the upright rectangle of minimum size that completely contains the shape

The minimum enclosing circle is the minimum circle that contains the shape

The polygonal approximation of a component is created by specifying an accuracy parameter giving the maximal acceptable distance between a shape and its simplified polygon

The convex hull, or convex envelope , of a shape is the minimal convex polygon that encompass a shape.

◦ It can be visualized as the shape that an elastic band would take if placed around the component

Shape descriptors

Polygonal approximation

Convex hull

Bounding box

Enclosing circle

Shape moments



Image moments



◦ Centroid (M

10

/M

00

,M

01

/M

00

)

Central moments

◦ Translation invariant



Hue moments

◦ Rotation invariant