Digital Image Processing Lectures: Deep Learning & Analysis

Al Bovik’s Lectures on Digital Image Processing Professor Alan C. Bovik, Director Laboratory for Image and Video Engineering ©Alan C. Bovik 2019 1 QUICK INDEX • Module 1 – Course Introduction, Imaging Geometry, Perception, Pixels, Perceptrons • Module 2 – Basics: Binary and Grayscale Image Processing, Multilayer Perceptrons • Module 3 – Fourier Transform, Image Frequencies, Sampling, RBFs and SVMs • Module 4 – Linear Filtering, Denoising, Restoration, Wavelets, ConvNets/CNNs • Module 5 – Image Denoising, Deep Learning, Transfer Learning, ResNet, Autoencoders • Module 6 – Image Compression, JPEG, Deep Compression • Module 7 – Image Analysis I: Image Quality, Edge/Shape Detection • Module 8 – Image Analysis II: Superpixels, Search, SIFT, Face Detection • Module 9 – Image Analysis III: Cortical Models, Pattern Analysis, Stereopsis, Deep Stereo • Module 10 – Neural Networks for Image Processing 2 Some Notes • These Lectures Notes are the basis for the course Digital Image Processing that I have taught at The University of Texas at Austin since 1991. • I modified them significantly in 2019, to capture the Deep Learning revolution which has deeply affected image processing. This is still a process! • They are not quite the same as those notes, since they are missing the hundreds of live demos running digital image processing algorithms. I hope you find them useful anyway. • They are also missing dozens of visual illusions, of which I am a collector, since I am uncertain of the copyright status of them. • If you use the notes, cite them as A.C. Bovik, Al Bovik’s Lecture Notes on Digital Image Processing, The University of Texas at Austin, 2017. • Enjoy! Nothing is as fun as Digital Image Processing! Well, except Digital Video. 3 Module 1 Introduction      Introduction Imaging Geometry Visual Perception Image Representation Perceptrons QUICK INDEX 4 Course Objectives  Learn Digital Image & Video Processing - Theory - Algorithms and Programming - Applications and Projects  To have fun doing it 5 The Textbooks  The Essential Guides to Image and Video Processing, Al Bovik, Academic Press, 2009.  Many chapters match the class notes  Full of illustrations and application examples  Many advanced chapters for projects/research (click for appropriate cartoon) 6 SIVA Demonstration Gallery • This course is multimedia with hundreds of slides and dozens of live image/video processing demos. • Demos from SIVA – The Signal, Image and Video Audiovisual Demonstration Gallery SIVA is a collection of didactic tools that facilitate a gentle introduction to signal and image processing. • Visit them at: http://live.ece.utexas.edu/class/siva/default.htm 7 What is this Image? View from the Window at Le Gras (Camera obscura; bitumen of Judea on pewter; Currently on display at the Harry Ransom Center The University of Texas at Austin) 8 Joseph Nicéphore Niépce Saint-Loup-de-Varennes (France), 1826 9 Joseph Nicéphore Niépce 1765-1833 French Inventor Also invented the Pyréolophore, the first internal combustion engine! 10 Go see it at the Harry Ransom Center (First Floor, just as you walk in to the Right!!) 11 12 Louis-Jacques-Mandé Daguerre 1787-1851 French Inventor Inventor of the daguerreotype - the first commercially successful photographic process. A daguerreotype is a direct positive on a silvered copper plate. Also an accomplished painter and developer of the diorama theatre! 13 14 Optical Imaging Geometry • Assume reflection imaging with visible light. • Let’s quantify the geometric relationship between 3-D world coordinates and projected 2-D image coordinates. point light source Sensing plate, CCD array, emulsion, etc. emitted rays image object lens focal length Reflected rays 15 3D-to-2D Projection • Image projection is a reduction of dimension (3D-to-2D): 3-D info is lost. Getting this info back is very hard. " f ie ld - o f - v i e w " • It is a topic of many years of intensive research: “Computer Vision” le n s c e n te r 2 - D im a g e 16 “The image is not the object” Rene Magritte (1898-1967) 17 Perspective Projection • There is a geometric relationship between 3-D space coordinates and 2-D image coordinates under perspective projection. • We will require some coordinate systems: 18 Projective Coordinate Systems Real-World Coordinates • (X, Y, Z) denote points in 3-D space • The origin (X, Y, Z) = (0, 0, 0) is the lens center Image Coordinates • (x, y) denote points in the 2-D image • The x - y plane is chosen parallel to the X - Y plane • The optical axis passes through both origins 19 Pinhole Projection Geometry • The lens is modeled as a pinhole through which all light rays hitting the image plane pass. • The image plane is one focal length f from the lens. This is where the camera is in focus. • The image is recorded at the image plane, using a photographic emulsion, CCD sensor, etc. 20 Pinhole camera or camera obscura principle for recording or drawing Concept attributed to Leonardo da Vinci 20 minute exposure with modern camera obscura 21 17th century camera obscura in use 22 Pinhole Projection Geometry Z Y Idealized "Pinhole" Camera Model f = focal length X lens center (X, Y, Z) = (0, 0, 0) image plane Problem: In this model (and in reality), the image is reversed and upside down. It is convenient to change the model to correct this. 23 Upright Projection Geometry Z Y Upright Projection Model f = focal length y x image plane (Not to scale!) X lens center (X, Y, Z) = (0, 0, 0) • Let us make our model more mathematical… 24 y Z (X, Y, Z) = (A, B, C) Y C B (x, y) = (a, b) f = focal length x image plane (0, 0, 0) A X • All of the relevant coordinate axes and labels … 25 A B a C b f • This equivalent simplified diagram shows only the relevant data relating (X, Y, Z) = (A, B, C) to its projection (x, y) = (a, b). 26 Similar Triangles • Triangles are similar if their corresponding angles are equal:       similar triangles 27 Similar Triangles Theorem • Similar triangles have their side lengths in the same proportions.  D d  F E   f   e D=d E e F =f D d E=e F f etc 28 Solving Perspective Projection • Similar triangles solves the relationship between 3-D space and 2-D image coordinates. • Redraw the geometry once more, this time making apparent two pairs of similar triangles: 29 A B B a b f C C b A f a C f • By the Similar Triangles Theorem, we conclude that a=A b=B f C f C f • OR: (a, b) = · (A, B) = (fA/C, fB/C) C 30 Perspective Projection Equation • The relationship between a 3-D point (X, Y, Z) and its 2-D image (x, y) : (x, y) = f · (X, Y) Z where f = focal length • The ratio f/Z is the magnification factor, which varies with the range Z from the lens center to the object plane. 31 Straight Lines Under Perspective Projection • Why do straight lines (or line segments) in 3-D project to straight lines in 2-D images? • Not true of lenses (e.g. "fish-eye“) that do not obey the pinhole approximation. y Z 3-D line Y 2-D line image plane To show this to be true, one could write the equation for a line in 3-D, and then project it to the equation of a 2-D line… x (0, 0, 0) X 32 • Easier way: • Any line touching the lens center and the 3-D line are in the same plane (a point and a line define a plane). • The intersection of this plane with the image plane gives the projection of the line. • The intersection of two (nonparallel) planes is a line. • So, the projection of a 3-D line is a 2-D line. 3-D line 2-D line •In image analysis (later), this property makes finding straight lines much easier! •This property of lenses makes it easier to navigate for us to navigate (Click for an example!). 33 Cameras are Now Computers 34 Steve Sasson and the First Digital Camera Fascinating history of the Digital Camera HERE 35 A/D Conversion • Sampling and quantization. • Sampling is the process of creating a signal that is defined only at discrete points, from one that is continuously defined. • Quantization is the process of converting each sample into a finite digital representation. 36 Sampling • Example: An analog video raster converted from a continuous voltage waveform into a sequence of voltage samples: continuous electrical signal from one scanline sampled electrical signal from one scanline 37 Sampled Image • A sampled image is an array of numbers (row, column) representing image intensities columns rows depiction of 10 x 10 image array • Each of these picture elements is called a pixel. 38 Sampled Image • The image array is rectangular (N x M), often with dimensions N = 2P and M = 2Q (why?) • Examples: square images • • • • P=Q=7 P=Q= 8 P=Q= 9 P=Q=10 128 x 128 256 x 256 512 x 512 1024x1024 1920x1080 (216 ≈ 16,000 pixels) 18 (2 ≈ 65,500 pixels) (220 ≈ 262,000 pixels) (214 ≈ 1,000,000 pixels) (= 2,073,600 pixels) 39 Sampling Effects • It is essential that the image be sampled sufficiently densely; else the image quality will be severely degraded. • Can be expressed via the Sampling Theorem) but the visual effects are most important (DEMO) • With sufficient samples, the image appears continuous….. 40 Sampling in Art Seurat - La Grande Jatte – Pointillist work took 2 years to create 41 42 43 44 Quantization • Each gray level is quantized: assigned an integer indexed from 0 to K-1. • Typically K = 2B possible gray levels. • Each pixel is represented by B bits, where usually 1 ≤ B ≤ 8. a pixel 8-bit representation 45 Quantization • The pixel intensities or gray levels must be quantized sufficiently densely so that excessive information is not lost. • This is hard to express mathematically, but again, quantization effects are visually obvious (DEMO) 46 Image as a Set of Bit Planes Bit Plane 1 Bit Plane 2 Bit Plane B 47 The Image/Video Data Explosion • Total storage for 1 digital image with 2P x 2Q pixels spatial resolution and B bits / pixel gray-level resolution is B x 2P+Q bits. • Usually B=8 and often P=Q=10. A common image size is 1 megabyte. • Ten years ago this was a lot. These days digital cameras produce much larger images. 48 The Image/Video Data Explosion • Storing 1 second of a 512x512 8-bit gray-level movie (TV rate = 30 images / sec) requires 30 Mbytes. • A 2-hour color theatre-quality raw 4K digital video: (3 bytes/color pixel) x (4096x2160 pixels/frame) x (60 frames/sec) x (3600 sec/hour) x (2 hours) requires 11.5 terabytes of storage. That's a lot today. • Later, we will discuss ways to compress digital images and videos. 49 A Bit About Visual Perception • In most cases, the intended receiver of the result of image/video processing or communications algorithms is the human eye. • A fair amount is known about the eye. It is definitely a digital computation device: - the neurons (rods, cones) sample and quantize - the retinal ganglion and cortical cells linearly filter 50 The Eye - Structure 178,000-238,000 cones/mm 300 Rods 100 Cells per degree cones 10 Ganglion Cells 3 -40 -20 0 20 Eccentricity (deg) 1.5 mm •Notice that image sampling at the retina is highly nonuniform! 51 40 An example of “foveated” art Madame Henriot Pierre-August Renoir 52 Eye Movement  The eyes move constantly, to place/keep the fovea on places of interest.  There are five major types of eye movement: - saccadic (attentional) - pursuit (smooth tracking) - vestibular (head movement compensating) - microsaccadic (tiny; image persistency) - vergence (stereoscopic) To demonstrate microsaccades, first fixate the center of the white dot for 10 sec, then fixate the small black dot. Small displacememts of the afterimage are then obvious -- the slow drifting movements as well as the corrective microsaccades. 53 Saccades and Fixations Highly contextual Eyes tracked using the “scleral coil” … Less contextual 54 Visual Attention  Eyes movements are largely about visual attention.  Attention is where conscious thought is directed  Usually (not always) towards the point of visual fixation  Related to the task the individual is engaged in  Not easy to focus attention while engaged in complex tasks … try this: Attention Video.  What about this poor fellow … Door Video. 55 Visual Eyetracking  Inexpensive ET’s are becoming fast and accurate. We have several.  Soon could be packaged with monitors, video communications devices, etc.  Basic technology: IR radiation reflected from highly reflective retina and cornea. 56 Visual Eyetrackers • Typical resolution: <0.5° (visual angle) at 60 Hz. • Ergonomically friendly. Requires 30 sec calibration - some new systems forego this. 57 Visual Eyetracking Desktop model Tracked eye Wearable model 58 Dual Purkinje Eyetracker • Dual Purkinje: Accurate to one minute of arc at 400 Hz - higher cost, less convenience. • Measures positional difference between the 1ST Purkinje reflection (front of cornea) and 4TH Purkinje reflection (rear of crystal lens). 59 Dual Purkinje Eyetracker SRI Generation V Dual Purkinje Eyetracker 60 Precision Recorded Eye Movements 61 Visual Limits • When designing image/video algorithms, it is good to know the limitations of the visual system. - spatial and temporal bandwidths - resolving power - color perception - visual illusions – the eye is easily fooled! 62 Contrast Sensitivity Function • Back in the 1960’s Campbell and Robson1 conducted psychophysical studies to determine the human frequency response. • Known as contrast sensitivity function 1F.W. Campbell and J.G. Robson, “Application of fourier analysis to the visibility of gratings,” Journal of Physiology, 1968. HERE 63 Michaelson Contrast • Given any small image patch, the Michelson Contrast of that patch is Lmax  Lmin Lmax  Lmin • Lmax and Lmin are the max and min luminances (brightnesses) over the patch 64 Spatial Sine Wave Gratings • Sine wave grating (0 < C < 1) C sin(Ux+Vy ) + 1 Contrast = C • (U, V) = spatial frequency in (x, y) directions • Orientation = Tan-1(V/U) • Radial (propagating) frequency SQRT(U2+V2) 65 Campbell & Robson Experiments • Campbell & Robson showed human subjects sine wave gratings of different frequencies and contrasts and recorded their visibility. • They argued that the human visual system does Fourier analysis on retinal images 66 Viewing Angle • Contrast sensitivity was (and is) recorded as a function of viewing angle. Image Plane Retina Foveal region  lens center 67 A Campbell-Robson Grating • Contrast increases downward. Frequency increases rightward. Helps visualize loss of visibility as function of frequency/contrast. 68 contrast spatial frequency 10 normalized sensitivity Compare the C-R Grating with the CSF 10 0 -1 -2 10 10 -1 10 0 10 1 spatial frequency (cycles/degree) 10 2 69 Contrast Sensitivity Function • The typical human contrast sensitivity function (CSF) is band-pass • Peak response around 4 cycles per degree (cy/deg), dropping off either side. 70 Why is the CSF Important for Image Processing? • Firstly, because we display pixels on today’s digital displays - individual pixels aren’t distinguishable (at a distance). • Secondly, it enables considerable image compression. 71 Image Sampling and the CSF • Digital images are spatially sampled continuous light fields. • How many samples are needed depends on the sampling theorem, on the CSF, and on viewing distance. • With sufficient samples, and at adequate distance, the image appears continuous. 72 What About Color? • Color is an important aspect of images. • A color image is a vector-valued signal. At each pixel, the image has three values: Red, Green, and Blue. • Usually expressed as three images: the Red, Green and Blue images: RGB representation. • Although color is important, in this class we will usually just process the intensity image I = R + G + B. • Many color algorithms process R, G, B components separately like gray-scale images then add the results. 73 Color isColor Important! • Any color may be represented as a mixture of Red (R), Green (G) and Blue (B). RGB codes color video as three separate signals: R, G, and B. • This is the representation captured by most color optical sensors. The Boating Party - Renoir • Color is important although perhaps not necessary for survival. 74 Color Sensing Raw RGB • The sensor itself is not specific to color. • Instead a color filter array (CFA) is superimposed over the sensor array. • The most common is the Bayer array. • Twice as many Green-tuned filters as Red-tuned or Blue-tuned – green constitutes more of the real-world visible spectrum, and the eyes are more sensitive to green wavelengths. 75 Color Sensing: Bayer CFA • Each sensor pixel is thus Red or Green or Blue, but there are no RGB pixels (yet). • These are often distinguished as “Raw RGB” vs RGB. 76 Demosaicking (Color Interpolation)) • What is desired is a Red and Green and Blue value at every pixel of an RGB image. • This is generally done by interpolating the known R, G, and B values to fill the places where it is unknown. • A plethora of methods exist, and no standard, but usually as simple as replicating or averaging the nearest relevant values. 77 Simple Demosaicking Interpolating Green Interpolating Blue • Simply average values to “fill in.” Interpolating Red Interpolating Red 78 RGB Image Result: RGB Color Planes RGB Color Space • A very large number of colors can be represented. 79 RGB Color R G B Intensity 80 Why RGB? • Can represent any color with RGB. There is the usual notion of “primaries” (like painting but the “primaries” are different. • However, can create color spaces using other “wavelength primaries” or without using colors at all. 81 Tristimulus Theory • The cones in the center of the retina (fovea) of the human eye can be classed according to their wavelength bandwidths. M cones L cones S cones • Roughly Red (L = long), Green (M = medium) and Blue (S = short) sensitive cone cells. 82 Tristimulus Theory • Given the known cone sensitivities it was long thought that the eye-brain system separately processed Red, Green and Blue channels. • However, RGB is not bandwidth efficient (each color uses the same BW). • The three RGB channels contain highly redundant information. The brain (and modern video processing systems) exploit these redundancies in many ways. • One way is through “color opponency”. 83 Photopic and Scotopic Vision Photopic (normal daylight) wavelength sensitivity Scotopic (normal nighttime) wavelength sensitivity • However, the rods do not distinguish wavelengths into color channels 84 Color Opponent Theory • The visual system processes luminance (brightness) and color separately. Practical opponent color space model approximates this. One simple one is: Y = luminance = aR + bG + cB (a + b + c = 1) U = K[Red (B)– luminance (Y)] = K2(B – Y) V = K[Blue (R) – luminance (Y)] = K2(R –Y) • This exploits redundancies since the difference signals will have much smaller entropies (cluster more tightly around the origin) and hence are more compressible. 85 Analog Color Video • YUV is an older analog color space. It defines brightness in terms of wavelength sensitivity. Here is a common definition: Luminance: Y = 0.299R + 0.587G + 0.114B Chrominance: U = -0.147R - 0.289G + 0.436B ≈ 0.492(B-Y) V = 0.615R - 0.515G - 0.1B ≈ -0.877(R-Y) • Ideas: – Structural info is largely carried by luminance. – Cones have highest sensitivity to G, then R, then B wavelengths - for evolutionary reasons. – Redundancy between luminance and chrominance is exploited by differencing them. – Notice that U = V = 0 when R = G = B. 86 RGB  YCrCb Examples RGB Y U V 87 Digital Color Images • YCrCb is the modern color space used for digital images and videos. • Similar but simpler definition: Y = 0.299R + 0.587G + 0.114B Cr = R  Y Cb = B  Y • Used in modern image and video codecs like JPEG and H.264. • Often the terms “YUV” and “YCrCb” are used interchangeably. • Why use YUV / YCrCb? Reduced bandwidth. Chrominance info can be sent in a fraction of the bandwidth of luminance info. • In addition to color information being “lower bandwidth,” the chrominance components are entropy reduced. 88 Color Constancy • A property of visual perception that ensures that the perceived color of objects remains relatively constant under varying illumination conditions. • For an example, an apple looks red (or green) both in the white light of mid-day and in the redder light of evening. • This helps with the recognition of objects. 89 Color Display XO-1 = “$100 Laptop” (One Laptop Per Child) Basic idea: each approximately square, rectangular, or ovoid pixel is composed of three neighboring R, G, and B “sub-pixels.” 90 Visual Illusions • Visual illusions are excellent probes into how we see. • They reveal much about the eye, how vision adapts, and finding where it “goes wrong” often confirms or explains models of vision. • They are also a great reminders that “what we see is not reality.” • We will be seeing visual illusions throughout the course. 91 Color Illusions 92 Remember the Internet Sensation? Is the middle dress gold and white, or is it blue and black? 93 Here’s the same dress under other lighting conditions Happens because of ‘color constancy’ – the vision system tries to see a color the same way under different lighting. 94 Digital Image Representation • Once an image is digitized it is an array of voltages or magnetic potentials. • Algorithms access a representation that is a matrix of numbers – usually integers, but possibly float or complex. 95 Image Notation • Denote an image matrix I = [I(i, j); 0 ≤ i ≤ N-1, 0 ≤ j ≤ M-1 ] where (i, j) = (row, column) I(i, j) = image value at (i, j) I(0, 0) I(1, 0) I= · · · I(N-1, 0) I(0, 1) · · · I(0, M-1) I(1, 1) · · · I(1, M-1) · · · · · · I(N-1, 1) · · · I(N-1, M-1) 96 Common Image Formats • JPEG (Joint Photographic Experts Group) images are compressed with loss – see Module 7. All digital cameras today have the option to save images in JPEG format. File extension: image.jpg • TIFF (Tagged Image File Format) images can be lossless (LZW compressed) or compressed with loss. Widely used in the printing industry and supported by many image processing programs. File extension: image.tif • GIF (Graphic Interchange Format) an old but still-common format, limited to 256 colors. Lossless and lossy (LZW) formats. File extension: image.gif • PNG (Portable Network Graphics) is the successor to GIF. Supports true color (16 million colors). Somewhat new - not yet widely supported. File extension: image.png • BMP (bit mapped) format is used internally by Microsoft Windows. Not compressed. Widely accepted. File extension: image.jbmp 97 Perceptrons 98 The Perceptron • The first “neural network” • It is a binary classifier (outputs ‘0’ or ‘1’) – i.e., it makes decisions. • Decisions are computed in two steps: – A linear combination of input values – A nonlinear thresholding or ‘activation’ function 99 Perceptron: Linear Step • Given an image I = {I(i, j); 0 < i, j < N-1, M-1 • Define any subset i  I which we will vectorize: i = {i(p); 1 < p < P} • This could be the entire image, or a block, or a segmented region, for example. 100 Perceptron: Linear Step • Define a set of weights w = {w(p); 1 < p < P} then the linear step is simply (inner product) P w T i   w(p)i(p) p 1 • We could have done this with 2D weights W M 1 P 1 W T I   W(m,p)I(m,p) m 0 p 0 but the 1D version takes less space, and the original ordering is implicit in the 1D vector. 101 Perceptron: Nonlinear Step • Given a bias value b, then the nonlinear binary thresholding function  1; w T i  b T f (i )    sign w i  b  T 1; w i  b is an example of an activation function. The activation function sign() is called the signum function. • A model of a single neuron which is only activated by a strong enough input. 102 Perceptron: Training • Need a set of Q training sample images with labels or targets: T =  i1 , t1  ,  i 2 , t 2  ,...,  i Q , t Q    • The training (sub) images iq are labeled with correct decisions/labels tq. • Given weights w there is no training error if and only T if t w i b  0 q  for all q = {1, 2, …, Q}. q  103 Perceptron: Optimization • Goal: Find weights that minimize the loss function Q L(w,b)    t q  w T i q  b  q 1 • It turns out that this can be zero only if the training set is linearly separable: {dq=0} {dq=1} Linearly Separable (by a line or hyperplane) Linearly Inseparable 104 Gradient of Cost • The gradient is the vector of partial derivatives  L L L  L(w,b)   , ,...,  w P   w1 w 2 T L( w, b) • We will also use b • Basic idea: optimize the perceptron by iteratively following the gradient down to a (possible) minimum. 105 Gradient Descent • The optimal weights w* must satisfy L(w, b)  0 and the Hessian matrix   2 L(w ,b)  w 2 1    2 L(w ,b)   w 2w1    2   L(w ,b)  w P w1  2 L(w, b)  2 L(w ,b)   w1w 2 w1w P    2 L(w, b)  2 L(w ,b)    2 w 2 w 2 w P       2 2  L(w, b)  L(w ,b)   w P w 2 w P2  at w* must be positive definite: d H  L(w*, b)  d  0 for any vector d  T P 106 Gradient Descent • Gradient descent: a simple algorithm that iteratively moves current solution in the direction of the gradient: w (n 1)  w (n )  γL  w (n )  or (exercise!) b (n 1) b (n) L(w (n ) ,b) γ b w (n 1)  w (n)  γt q i q b (n 1) b (n )  γt q q = 1, 2, …., Q given an initial guess (say w(0) = 0 or random). • Here 0 is the learning rate. If too big it might not converge; too small, might converge very slowly). 107 Problems with Perceptron • First, there no lower bound to the loss function! Makes numerical solution hard. • Rosenblatt’s Solution: Only iterate on those training images whose solution condition is violated at each (n 1) (n) step: w  w  γt i b (n 1) b (n )  γt q q q   for those q’ = 1, 2, …., Q such that t q w i q  b  0. T • This does converge but only if the training set is separable. Which is rare, especially on hard problems! 108 Can Converge Anywhere in “the Margin” • The solution converged to is not necessarily unique! • It can lie on any line (hyperplane) separating the linearly separable classes. 109 Training Iteration • Iterate by repeatedly applying every image in the training set (one “epoch”): f (n) (i q )  sign  w (n)T iq  b (n)  • Update the weights w on each training sample. • Standard Gradient Descent: Iterate through i1, i2, …, iQ in indexed order each epoch. • Stochastic Gradient Descent: Randomly re-order the training set {iq} after every epoch. • Iterate a fixed number of epochs, or alternately until an error f q(n)  t q is small enough (for some norm such as MSE). 110 Perceptron Diagram • Given a (vectorized) image or image piece: i = {i(p); 1 < p < P} a bias b, and a trained set of weights w = {w(p); 1 < p < P} i(1) w(1) i(2) w(2) input i ··· i(3) i(P) input layer (passive nodes) ··· w(3) w(P) weights  -b bias output y activation function output layer 111 Perceptron Diagram • Convenient to redraw this way (notice the modified unit bias) w(1) i(1) w(2) i(2)  w(3) ··· ··· i(3) i(P) input layer bias w(P) weights 1 activation function • Input layer includes a constant bias • Output layer: sum and activation function • Very common notation for general neural networks output layer • The –b can be absorbed into the other weights by normalization (more later). • Sometimes the bias is drawn as another (unit) input. 112 Simplified Perceptron Diagram • If the activation function is understood (and we will be considering other ones!), can just draw this: w(1) i(1) w(2) i(2) output y input i • Input layer includes a constant bias • Output layer: sum, including the bias and activation function • Very common notation for vanilla neural networks ··· i(3) i(P) input layer ··· w(3) w(P) weights output layer 113 Image Data • Let’s be clear: perceptrons were not used for image processing! • Certainly no feeding of pixels into perceptrons to do anything useful. • There weren’t even many digital images. • Also, the computation would have been far beyond formidable back then. • HOWEVER, at some point we will be feeding pixels into learners, so it is good to cast things that way. 114 Promise of Perceptrons? In 1958 pioneer Frank Rosenblatt said: “the perceptron is the embryo of an electronic computer that will …. be able to walk, talk, see, write, reproduce itself, and be conscience of its existence.” • There were/are significant problems with perceptrons. • But today, it is starting to look like he was right! 115 Failure of Perceptrons • In the 1960s it was found that perceptrons did not have much ability to represent even simple functions or data patterns. • Marvin Minsky and Seymour Papert showed it could not even represent the XOR function. • Never really used for image analysis (no “digital” images back then, far too complex). • Interest languished for a decade or more, hence thus ended the “first wave” of neural networks. 116 Comments on Perceptrons • We will discuss concepts like validation sets and test sets as we study more complex networks. • Perceptrons were the first neural networks. They were the first feed-forward neural networks (the only kind we’ll study here). • With modifications, they are the basis of today’s modern deep convolutions networks (“ConvNets” or CNNs). • Things changed when researchers began to layer the networks. 117 Comments • With this broad overview of topics related to image processing in hand, and a start on neural networks, we can proceed … onward to Module 2….. 118 Module 2 The Basics: Binary Images & Point Operations      Binary Images Binary Morphology Point Operations Histogram Equalization Multilayer Perceptrons QUICK INDEX 119 Binary Images • A digital image is an array of numbers: sampled image intensities: A 10 x 10 image • Each gray level is quantized: assigned one of a finite set of numbers [0,…,K-1]. • K = 2B possible gray levels: each represented by B bits. columns rows • Binary images have B = 1. 120 Binary Images A 10 x 10 binary image • In binary images the (logical) values '0' and '1' often indicate the absence/presence of an image property in an associated gray-level image: - High vs. low intensity (brightness) - Presence vs. absence of an object - Presence vs. absence of a property Example: Presence of fingerprint ridges 121 Gray-level Thresholding • Often gray-level images are converted to binary images. • Advantages: - B-fold reduction in required storage - Simple abstraction of information - Fast processing - logical operators 122 Binary Image Information • Artists have long understood that binary images contain much information: form, structure, shape, etc. “Don Quixote” by Pablo Picasso 123 Simple Thresholding • The simplest image processing operation. • An extreme form of quantization. • Requires the definition of a threshold T (that falls in the gray-scale range). • Every pixel intensity is compared to T, and a binary decision rendered. 124 Simple Thresholding • Suppose gray-level image I has K gray-levels: 0, 1,...., K-1 • Select a threshold 0 < T < K-1. • Compare every gray-level in I to T. • Define a new binary image J as follows: J(i, j) = '0' if I(i, j) ≥ T J(i, j) = '1' if I(i, j) < T I Threshold T J 125 Threshold Selection • The quality of the binary image J from thresholding I depends heavily on the threshold T. • Different thresholds may give different valuable abstractions of the image – or not! • How does one decide if thresholding is possible? How does one decide on a threshold T ? 126 Gray-Level Image Histogram • The histogram HI of image I is a graph of the gray-level frequency of occurrence. • HI is a one-dimensional function with domain 0, ... , K-1. • HI(k) = n if I contains exactly n occurrences of gray level k; k = 0, ... K-1. 127 Histogram Appearance • The appearance of a histogram suggests much about the image: H (k) I H (k) I 0 gray level k K-1 Predominantly dark image 0 gray level k K-1 Predominantly light image • Could be histograms of underexposed and overexposed images, respectively. 128 Histogram Appearance • This histogram may show better use of the gray-scale range: H (k) I 0 gray level k K-1 HISTOGRAM DEMO 129 Bimodal Histogram • Thresholding usually works best when there are dark objects on a light background. • Or when there are light objects on a dark background. • Images of this type tend to have histograms with distinct peaks or modes. • If the peaks are well-separated, threshold selection is easier. 130 Bimodal Histogram H (k) I bimodal histogram poorly separated 0 gray level k K-1 H (k) I bimodal histogram well separated peaks 0 gray level k K-1 Where to set the threshold in these two cases? 131 Threshold Selection from Histogram • Placing threshold T between modes may yield acceptable results. • Exactly where in between can be difficult to determine. threshold T H (k) I threshold selection 0 gray level k K-1 132 Multi-Modal Histogram • The histogram may have multiple modes. Varying T will give very different results. T? T? multi-modal histogram H (k) I 0 K-1 gray level k 133 Flat Histogram • The histogram may be "flat," making threshold selection difficult: H (k) I flat histogram 0 gray level k K-1 Thresholding DEMO 134 Discussion of Histogram Types • We'll use the histogram for gray-level processing. Some general observations: - Bimodal histograms often imply objects and background of different average brightness. - Easier to threshold. - The ideal result is a simple binary image showing object/background separation, e.g, - printed type; blood cells in solution; machine parts. • The most-used method is Otsu’s (here) based on maximizing the separability of object and background brightness classes. But it gives errors as much as other methods. 135 Histogram Types • Multi-modal histograms often occur in images of multiple objects of different average brightness. • “Flat” or level histograms imply more complex images having detail, non-uniform backgrounds, etc. • Thresholding rarely gives perfect results. • Usually, region correction must be applied. 136 Illusions of Shape, Size and Length 137 Kanisza Triangle What tabletop is bigger? Watch the video! Which lines are longer? 138 Coming or going? 139 Binary Morphology • A powerful but simple class of binary image operators. • General framework called mathematical morphology morphology = shape • These affect the shapes of objects and regions. • All processing done on a local basis. 140 • Morphological operators: - Expand (dilate) objects - Shrink (erode) objects - Smooth object boundaries and eliminate small regions or holes - Fill gaps and eliminate 'peninsulas‘ • All is accomplished using local logical operations 141 Structuring Elements or “Windows” • A structuring element or window defines a geometric relationship between a pixel and its neighbors. Some examples: 142 Windows • Conceptually, a window is passed over the image, and centered over each pixel along the way. • Usually done row-by-row, column-by-column. • A window is also called a structuring element. • When centered over a pixel, a logical operation on the pixels in the window gives a binary output. • Usually a window is approximately circular so that object/image rotation won’t effect processing. 143      . . . . . .   A structuring element moving over an image.  144 Formal Definition of Window • A window is a way of collecting a set of neighbors of a pixel according to a geometric rule. • Some typical (1-D row, column) windows: COL(3) ROW(3) COL(5) ROW(5) 1-D windows: ROW(2M+1) and COL(2M+1) • Windows almost always cover an odd number of pixels 2M+1: pairs of neighbors, plus the center pixel. Then filtering operations are symmetric. 145 • Some 2-D windows: SQUARE(9) SQUARE(25) CROSS(5) CIRC(13) CROSS(9) 2-D windows: SQUARE(2P+1), CROSS(2P+1), CIRC(2P+1) 146 Window Notation • Formally, a window B is a set of coordinate shifts Bi = (pi, qi) centered around (0, 0): B = {B1, ..., B2P+1} = {(p1, q1), ..., (p2P+1, q2P+1)} Examples - 1-D windows B = ROW(2P+1) = {(0, -P), ..., (0, P)} B = COL(2P+1) = {(-P, 0), ..., (P, 0)} For example, B = ROW(3) = {(0, -1), (0, 0), (0, 1)} 147 2-D Window Notation B = SQUARE (9) = {(-1, -1) , (-1, 0), (-1, 1), (0, -1) , (0, 0), (0, 1), (1, -1) , (1, 0), (1, 1)} B = CROSS(2P+1) = ROW(2P+1)  COL(2P+1) For example, B = CROSS(5) = { (-1, 0), (0, -1), (0, 0), (0, 1), (1, 0) } 148 The Windowed Set • Given an image I and a window B, define the windowed set at (i, j) by: BI(i, j) = {I(i-p, j-q); (p, q) B} the pixels covered by B when centered at (i, j). • This formal definition of a simple concept will enable us to make simple and flexible definitions of binary filters. 149 • B = ROW(3): BI(i, j) = {I(i, j-1) , I(i, j), I(i, j+1)} • B = COL(3): BI(i, j) = {I(i-1, j) , I(i, j), I(i+1, j)} • B = SQUARE(9): BI(i, j) = {I(i-1, j-1) , I(i-1, j), I(i-1, j+1), I(i, j-1) , I(i, j), I(i, j+1), I(i+1, j-1) , I(i+1, j), I(i+1, j+1)} • B = CROSS(5): BI(i, j) = { I(i-1, j), I(i, j-1), I(i, j), I(i, j+1), I(i+1, j) } 150 General Binary Filter • Denote a binary operation G on the windowed set BI(i, j) by J(i, j) = G{BI(i, j)} = G{I(i-p, j-q); (p, q)  B} • Perform this at every pixel in the image, giving filtered image J = G[I, B] = [J(i, j); 0 ≤ i ≤ N-1, 0 < j < M-1] 151 Edge-of-Image Processing • What if a window overlaps "empty space" ? Our convention: fill the "empty" window slots with the nearest image pixel. This is called replication. 152 Dilation and Erosion Filters • Given a window B and a binary image I: J = DILATE(I, B) if J(i, j) = OR{BI(i, j)} = OR{I(i-p, j-q); (p, q) B} • Given a window B and a binary image I: J = ERODE(I, B) if J(i, j) = AND{BI(i, j)} = AND{I(i-p, j-q); (p, q) B} 153 Erosion Dilation • DILATION increases the size of logical ‘1’ (usually black) objects. Examples of local DILATION & EROSION computations. • EROSION decreases the size of logical ‘1’ objects. OR AND = logical ‘1’ I J B = = logical ‘0’ = logical ‘1’ changed by DILATION or EROSION I J B = 154 Interpreting Dilation & Erosion Global interpretation of DILATION: It is useful to think of the structuring element as rolling along all of the boundaries of all BLACK objects in the image. The center point of the structuring element traces out a set of paths. That form the boundaries of the dilated image. Global interpretation of EROSION: It is useful to think of the structuring element as rolling inside of the boundaries of all BLACK objects in the image. EROSION & DILATION DEMO The center point of the structuring element traces out a set of paths. That formthe boundaries of the eroded image. 155 Qualitative Properties of Dilation & Erosion Dilation removes holes of too-small size and gaps or bays of too-narrow width: Erosion removes objects of too-small size and peninsulas of too-narrow width: DILATE ERODE DILATE ERODE 156 Majority or “Median” Filter • Given a window B and a binary image I: J = MAJORITY(I, B) if J(i, j) = MAJ{BI(i, j)} = MAJ{I(i-p, j-q); (p, q)  B} • Has attributes of dilation and erosion, but doesn’t change the sizes of objects much. 157 Majority/Median Filter A C MAJ B I B = J The majority removed the small object A and the small hole hole B, but did not change the boundary (size) of the larger region C. Majority Filter DEMO 158 Qualitative Properties of Majority • Median removes both objects and holes of too-small size, as well as both gaps (bays) and peninsulas of too-narrow width. MAJORITY MAJORITY 159 3-D Majority Filter Example • The following example is a 3-D Laser Scanning Confocal Microscope (LSCM) image (binarized) of a pollen grain. Magnification  200x Examples of 3-D windows: CUBE(125) and CROSS3-D(13) 160 LSCM image of pollen grain 161 Pollen grain image filtered with CUBE(125) binary majority filter This could be 3D printed … 162 And we did! One of the first 3D printing jobs ever, back in the late 1980s. It was then called “selective laser sintering” 163 OPEN and CLOSE • Define new morphological operations by performing the basic ones in sequence. • Given an image I and window B, define OPEN(I, B) = DILATE [ERODE(I, B), B] CLOSE(I, B) = ERODE [DILATE(I, B), B] 164 OPEN and CLOSE • In other words, OPEN = erosion (by B) followed by dilation (by B) CLOSE = dilation (by B) followed by erosion (by B) • OPEN and CLOSE are very similar to MEDIAN: - OPEN removes too-small objects/fingers (better than MEDIAN), but not holes, gaps, or bays. - CLOSE removes too-small holes/gaps (better than MEDIAN) but not objects or peninsulas. - OPEN and CLOSE generally do not affect object size. - Thus OPEN and CLOSE are highly biased smoothers. 165 OPEN and CLOSE vs. Majority OPEN MAJORITY CLOSE OPEN and CLOSE DEMO 166 OPEN-CLOSE and CLOSE-OPEN • Effective smoothers obtained by sequencing OPEN and CLOSE: OPEN-CLOS(I, B) = OPEN [CLOSE (I, B), B] CLOS-OPEN(I, B) = CLOSE [OPEN (I, B), B] • These operations are similar (but not identical). They are only slightly biased towards 1 or 0. 167 Application Example Simple Task: Measuring Cell Area (i) Find general cell region by thresholding (ii) Apply region correction (clos-open) (iii) Display cell boundary for operator verification (iv) Compute image cell area by counting pixels (v) Compute true cell area via perspective projection 168 Measuring Cell Area Conventional Optical Microscope 169 Cell Area Measurement Example #1 (a) (b) (c) (d) Cellular mass Thresholded Region corrected Computed boundary overlaid 170 Cell Area Measurement Example #2 (a) (b) (c) (d) Cellular mass Thresholded Region corrected Computed boundary overlaid 171 Comments • Many things can be accomplished with binarized images – since shape is often well-preserved. • However, gray-scales are also important – next, we will deal with gray scales – but not shape. 172 Gray-Level Point Operations 173 Brightness Illusions 174 Mach Bands Count the black dots Afterimages 175 176 Squares A and B are identical! (Ed Adelson) 177 Let’s play … do you want the white or black pieces? 178 Simple Histogram Operations • Recall: the gray-level histogram HI of an image I is a graph of the frequency of occurrence of each gray level in I. • HI is a one-dimensional function with domain 0, ... , K-1: • HI(k) = n if gray-level k occurs (exactly) n times in I, for each k = 0, ..., K-1. 179 H (k) I 0 K-1 gray level k • HI contains no spatial information - only the relative frequency of intensities. • Much useful info is obtainable from HI, such as average brightness: 1 K-1 1 N-1M-1 L AVE (I) = I(i,j) = kH I (k)    NM i=0 j=0 NM k=0 • Image quality is affected (enhanced, modified) by altering HI. 180 Average Brightness • Examining the histogram can reveal possible errors in the imaging process: underexposed Low LAVE overexposed High LAVE • By operating on the histogram, such errors can be ameliorated. 181 Point Operations • Point operation: a function f on single pixels in I: J(i, j) = f[I(i, j)], 0 ≤ i ≤ N-1, 0 ≤ j ≤ M-1 • The same function f applied at every (i, j). • Does not use neighbors of I(i, j). • They don’t modify spatial relationships. • They do change the histogram, and therefore the appearance of the image. 182 Linear Point Operations • The simplest class of point operations. They offset and scale the image intensities. • Suppose -(K-1) ≤ L ≤ K-1. An additive image offset is defined by J(i, j) = I(i, j) + L • Suppose P > 0. Image scaling is defined by J(i, j) = P·I(i, j) 183 Image Offset • If L > 0, then J is a brightened version of I. If L < 0, a dimmed version of I. • The histogram is shifted by amount L: HJ(k) = HI(k-L) Original L>0 (DEMO) L<0 Shifted by L 184 Image Scaling • • • • J(i, j) = P·I(i, j) If P > 1, the intensity range is widened. If P < 1, the intensity range is narrowed. Multiplying by P stretches or compresses the image histogram by a factor P: HJ(k) = HI(k/P) (continuous) HJ(k) = HI[INT(k/P)] (discrete) 185 Image Scaling B-A (DEMO) A B P(B-A) P(B-A) PA PB PA PB • An image with a compressed gray level range generally has reduced visibility – a washed out appearance (and vice-versa). 186 Linear Point Operations: Offset & Scaling • Given reals L and P, a linear point operation on I is a function J(i, j) = P·I(i, j) + L comprising both offset and scaling. • If P < 0, the histogram is reversed, creating a negative image. Usually P = -1, L = K-1: J(i, j) = (K-1) - I(i, j) (Digital Negative DEMO) 187 Full-Scale Contrast Stretch • The most common linear point operation. Suppose I has a compressed histogram: 0A B K-1 • Let A and B be the min and max gray levels in I. Define J(i, j) = P·I(i, j) + L such that PA+L = 0 and PB + L = (K-1). 188 Full-Scale Contrast Stretch • Solving these 2 equations in 2 unknowns yields: or K-1 K-1 P = and L = -A B-A B-A J(i, j) = (K-1) [I(i, j) - A] / (B - A) • The result is an image J with a full-range histogram: FSCS (DEMO) 189 0 K-1 Nonlinear Point Operations • Now consider nonlinear point functions f J(i, j) = f[I(i, j)]. • A very broad class of functions! • Commonly used: J(i, j) = |I(i, j)| J(i, j) = [I(i, j)]2 J(i, j) = [I(i, j)]1/2 J(i, j) = log[1+I(i, j)] J(i, j) = exp[I(i, j)] = eI(i,j) (magnitude) (square-law) (square root) (logarithm) (exponential) • Most of these are special-purpose, for example… 190 Logarithmic Range Compression • Small groupings of very bright pixels may dominate the perception of an image at the expense of other rich information that is less bright and less visible. • Astronomical images of faint nebulae and galaxies with dominating stars are an excellent example. 191 The Rosette Nebula 192 Logarithmic Range Compression • Logarithmic transformation J(i, j) = log[1+I(i, j)] nonlinearly compresses and equalizes the gray-scales. • Bright intensities are compressed much more heavily - thus faint details emerge. 193 Logarithmic Range Compression • A full-scale contrast stretch then utilizes the full gray-scale range: 0 typical histogram K-1 0 K-1 logarithmic transformation 0 K-1 stretched contrast 194 Contrast Stretched Rosette 195 Rosette in color 196 Gamma Correction • Monitors that display images and videos often have a nonlinear response. • Commonly an exponential nonlinearity Display(i, j) = [I(i, j)] • Gamma correction is (digital) preprocessing to correct the nonlinearity: J(i, j) = [I(i, j)]1/ • Then Display(i, j) = [J(i, j)] = I(i, j) 197 Gamma Correction • For a CRT (e.g., analog NTSC TV), typically  = 2.2 • This is accomplished by mapping all luminances (or chrominances) to a [0, 1] range. • Hence black (0) and white (1) are unaffected. • Plasma and LCD have linear characteristics, hence do not need gamma correction. But many devices that feed them still gamma correct, hence reverse nonlinearity often needed. 198 Gamma Correction 199 Histogram Distribution • An image with a flat histogram makes rich use of the available gray-scale range. This might be an image with - Smooth changes in intensity across many gray levels - Lots of texture covering many gray levels • We can obtain an image with an approximately flat histogram using nonlinear point operations. 200 Normalized Histogram • Define the normalized histogram: pI (k) = 1 H I (k) ; k = 0 ,..., K-1 MN • These values sum to one: K-1  pI (k) = 1 k=0 • Note that pI(k) is the probability that gray-level k will occur (at any given coordinate). 201 Cumulative Histogram • The cumulative histogram is PI (r) = r  pI (k) ; r = 0 ,..., K-1 k=0 which is non-decreasing; also, PI(K-1) = 1. • Probabilistic interpretation: at any (i, j): PI(r) = Pr{I(i, j) ≤ r} pI(r) = PI(r) - PI(r-1) ; r = 0,..., K-1 202 Continuous Histograms • Suppose p(x) and P(x) are continuous: can regard as probability density (pdf) and cumulative distribution (cdf). • Then p(x) = dP(x)/dx. • We’ll describe histogram flattening for the continuous case, then extend to discrete case. 203 Continuous Equalization • Transform (continuous) I, p(x), P(x) into image K with flat or equalized histogram. • The following image will have a flattened histogram with range [0, 1]: J = P(I) (J(i, j) = P[I(i, j)] for all (i, j)) 204 Continuous Flattening • Reason: the cumulative histogram Q of J: Q(x) = Pr{J ≤ x} (at any pixel (i, j)) = Pr{P(I) ≤ x} = Pr{I ≤ P-1(x)} = P[P-1(x)] = x hence q(x) = dQ(x)/dx = 1 for 0 < x < 1 • Finally, K = FSCS(J). 205 Discrete Histogram Flattening • To approximately flatten the histogram of the digital image I: • Define the cumulative histogram image J = PI(I) so that J(i, j) = PI[I(i, j)]. • This is the cumulative histogram evaluated at the gray level of the pixel (i, j). 206 Discrete Histogram Flattening • Note that 0 ≤ J(i, j) ≤ 1 • The elements of J are approximately linearly distributed between 0 and 1. • Finally, let K = FSCS(J) yielding the histogram-flattened image. 207 Histogram Flattening Example • Given a 4x4 image I with gray-level range {0, ..., 15} (K-1 = 15): I= • The histogram: k 1 2 8 4 1 5 1 5 3 3 8 3 4 2 2 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 H(k) 0 3 3 3 2 2 0 0 2 0 0 1 0 0 0 0 208 • The normalized histogram… k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 p(k) 0 3 3 3 2 2 16 16 16 16 16 0 0 2 16 0 0 1 16 0 0 0 0 • The intermediate image J is computed followed by the "flattened" image K (after rounding/FSCS): J= 3/16 3/16 9/16 11/16 6/16 13/16 9/16 6/16 15/16 3/16 15/16 6/16 11/16 13/16 9/16 16/16 K= 0 0 7 3 12 7 9 3 14 0 14 3 9 12 7 15 209 • The new, flattened histogram looks like this: k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 H(k) 3 0 0 3 0 0 0 3 0 2 0 0 2 0 2 1 Histogram Flattening (DEMO) • The heights H(k) cannot be reduced - only stacked • Digital histogram flattening doesn't really "flatten" - just spreads out the histogram – more flat. • The spaces that appear are characteristic of "flattened" histograms - especially when the original histogram is highly compressed. 210 Multilayer Perceptrons 211 Multilayer Perceptrons (MLPs) • Neural networks began to fulfill their promise with MLPs. Often called “vanilla” or “artificial neural networks.” • Operate by layering perceptrons in a feed-forward manner. • Can learn to represent data that is not linearly separable, hence also complex patterns, including those in images. • However, not by feeding pixels directly into MLPs (yet). • Uses more powerful and diverse activation functions. • Can be efficiently trained via a method called backpropagation. 212 Multilayer Perceptrons • Add a layer between the input layer and the output layer. It will also consist of: – summation – activation function • The new “hidden layer” (invisible to input and output) can: – potentially receive all inputs – potentially feed all output layer nodes • We will also use different activation functions than the signum function having better properties, like differentiability. • We can easily have more outputs. 213 Activation Functions • The signum function f(x) = sign(x) is discontinuous. Other functions are possible that can limit or rectify neural responses. • They are not binary, which affords greater freedom (for regression, instead of just classification). • The two most popular activation functions used with ANNs (before the “deep” era) were the tanh and logistic functions which are sigmoid (s-shaped) functions. e x  e-x f(x)  tanh(x) = x -x e +e1 f(x)  logistic(x) = 1+e-x • Both limit and are differentiable. 214 Multilayer Perceptron Diagram • This example has two outputs and is fully connected. Every node in a layer feeds every node in the next layer. i(1) i(2) input i i(3) ··· ··· outputs y i(P) input layer weights w2 output weights layer w1 R  • Each large node is of the form f   w q (r)i q (r)  where iq are the R  r 1  inputs of a qth layer with weights wq and activation function f. 215 Still Too Much Computation • … to operate on pixels directly (back then). • Huge computation: – P2 weights in w2 – Only 2P in w1, but there can be as many outputs as pixels. • A 1024x1024 image (P=220) implies 240 weights in w2!* • Clearly untenable. The number of nodes, or connections (or both) must somehow be greatly limited. • Using small images obviously helps. A 128x128 image (P= 214) implies P= 228 weights in w2. But that’s not the answer either… *Note: 240  1 Trillion, while 214  250 Million 216 Features • Instead, highly informative features would be extracted from images. Trained on many images’ features. • Vastly fewer than the number of pixels: often just single digits, or dozens. • Could be simple image statistics, Fourier features, wavelet/bandpass filter data, regional colors, “busyness” and so on. VAST varieties have been used. • We’ll talk about MANY later: SIFT, SURF, LBP, etc etc. • These days they are typically called “handcrafted” as opposed to “learned.” These days, “handcrafted” is somewhat of a pejorative in the ML community! 217 Example Feature-Driven MLP Day vs Night Detector • Task: determine whether an image was taken in daytime or night time. • The following intuitive features might be extracted from each training image: – Average luminance (brightness) Lave of the image – Standard deviation of luminance Ldev of the image – Color saturation Csat of the image (how widely distributed is color) – Maximum luminance Lmax of the image • Are these the “right” features? They make sense, but who knows? • Normalization: Usually each feature is normalized to [0, 1] (e.g.) so no feature has an outsized effect (unless that’s desired). 218 Day vs Night Detector Network • A small fully connected network. Every node in a layer feeds every node in the next layer. Output can be continuous or thresholded. Lave Output y Ldev input i* “probability” of daytime” Csat L max weights input layer weights weights [4, 5] [5, 3] w3 *Still calling the input vector i w2 w1 [3, 1] output layer 219 1 -1 e x  e-x tanh(x) = x -x e +e 1 0 1 logistic(x) = 1+e-x 220 Training by Backpropagation • Perhaps the greatest advance in the history of neural networks – except for the Perceptron, of course. • “Backprop” is short for “backward propagation of errors.” • It means the weights w1, …, wL of all layers are adjusted to minimize the training error w.r.t. the known training labels (day/night, or whatever). • It is convenient if the activation function fis differentiable. • If f(x) = tanh(x), then f(x) = 1 – tanh2(x) • If f(x) = logistic(x), then f(x) = ex logistic2(x) 221 Training by Backpropagation • Given training labels T = {tp; 1 < p < P} of a neural network with outputs Z = {zp; 1 < p < P}, form the MSE loss function* 1 P E   (z p  t p ) 2 2 p 1 • Then for an arbitrary neuron anywhere in the network indexed k with output  J  y k  f   w k ( j)y j   j1  • The goal is to minimize the loss function E over all weights wk = {wk(j); 1 < j < J} by gradient descent. *The ½ is just to cancel a later term 222 Double Chain Rule • Differentiate E w.r.t. each weight E E y k x k  w k ( j) y k x k w k ( j) • Note that yk  f  x k  J x k   w k (i)yi i 1 x k  yj w k ( j) y k  f(x k ) x k • If neuron k is in the output layer (yk = zk) then E(y k ) E   zk  t k y k z k 1 P E   (z p  t p ) 2 2 p 1 223 Hidden Layer Neuron • ··· ··· • Let neuron j be at arbitrary location. Then E is a function of all neurons V = {a, b, c, …, z} receiving input from j (in the next layer) and a E  y k  E  x a , x b ,..., x g   yj y k y k b Take the total derivative: c E E x v E y v   w k (v) wj y k vV x v y k vV y v x v z • A recursion! The derivative wrt yk can be computed from the derivatives wrt the outputs {yv} of the next layer. 224 Putting it Together • So finally where • E  δk y j w k ( j)  f E y k  (x k )  y k  t k  δk    f(x ) w (v)δ k  k v y k x k  vV  Backprop proceeds by gradient descent. For a learning rate ,* E w k ( j)   γ   γδ k y j w k ( j) • We will see other loss functions and activation functions later. 225 *Picking  can be a trial-and-error process! Training, Validation, and Testing • MLPs/ANNs are tools to construct algorithms that learn from data (training) and make predictions (testing) from the learned model. • There is a Universal Approximation Theorem (Cybenko 1989, Hornik 1991) that states that even a single-layer feed-forward MLP can approximate any continuous function defined on a compact* subset of Rn arbitrarily closely, provided that the activation function is a bounded, continuous function. • This suggests the potential of MLPs to be well-trained to make numerical predictions, but it is no guarantee that an MLP is a good predictor! • MLPs must be trained on adequately sizable and representative data, and must be validated. *Closed and bounded 226 Training, Validation, and Testing • Basic process: given training samples T=  i , t  ,  i , t 1 1 2 2  ,...,  i Q , t Q  • Feed sequentially to the MLP optimizing by backprop using gradient descent (GS). Repeat epochs until a stopping criteria is reached (based on loss), possibly randomly reordering the samples (stochastic GS). • Once learned, apply the model on a separate validation set V=  j , v  ,  j , v  ,...,  j , v  1 1 2 2 P P on which the network parameters can be tuned (e.g. node density, #layers, etc) and a stopping point decided (e.g., beyond which the loss starts to rise from overfitting (too little data, usually). • Finally, a separate test set is used to measure the performance of the model. All these can be drawn from a same large dataset but must be disjoint. • Lastly, the input features are usually normalized to (say) [0, 1] so that the 227 feature range will not affect the results too much. Case Study: Case Study: Classifying Microcalcifications 228 Classifying Microcalcifications • Based on a paper by Chan et al (1997) here. • Idea: use texture features on “regions of interest” (ROIs) of mammograms to train a MLP to classify as benign or malignant. • Great example of using a small number of features from a small number of images to train a MLP classifier. 229 Microcalcifications Microcalcifications in malignant breast tissue230 Features • 13 texture features (simple computed statistics) of ROIs: – – – – – – – – – – – – – sample correlation sample entropy sample energy sample inertia inverse difference moment sum average, sum entropy difference entropy difference average sum variance difference variance information measure of correlation 1 information measure of correlation 2 231 Training an MLP The MLP 232 Training, Validation, and Testing • A total of 86 mammograms from 54 cases were (26 benign, 28 malignant) were used. All recommended for surgical biopsy. • In each training phase, MLP was trained on 85, tested on one. Repeated 86 times. Called “leave one out” training. • MLP output was thresholded to classify. • 26 of 26 malignant cases identified (100% sensitivity) • 11 of 28 benign cases identified (39% specificity) 233 Area under ROC Convergence Many epochs needed before converging. 234 General Comments • Studies like this were interesting, but required too much computation for larger data. • Training is a huge burden. • HENCE: MLPs / ANNs remained a topical interest that dwindled through the 90’s. • However, another method succeeded quite well and remains popular today. 235 Comments • We will next look at spatial frequency analysis and processing of images … onward to Module 3! 236 Module 3 Fourier Transform      Sinusoidal Image Discrete Fourier Transform Meaning of Image Frequencies Sampling Theorem Radial Basis Functions and Support Vector Machines QUICK INDEX 237 Sinusoidal Images • An image with the simplest frequency content is a sinusoidal image. • A discrete sine image I has elements v u I(i, j) = sin [2p( N i + M j)] for 0 ≤ i ≤ N-1, 0 ≤ j ≤ M-1 and a discrete cosine image has elements v u I(i, j) = cos [2p( N i + M j)] where u,v are integer frequencies in the iand j-directions (cycles/image). 238 Spatial Frequencies 239 Radial Frequency • The radial frequency (how fast the image oscillates in its direction of propagation) is W = u2 + v2 • The angle of the wave (relative to i-axis) is  = tan-1(v/u) (Discrete Sinusoid DEMO) 240 Digital Sinusoidal Example • Let N = 16, v = 0: I(i) = cos (2pui/16): a cosine wave oriented in i-direction with frequency u. One row: u=1 or u=15 1 1 0 0 -1 0 2 4 6 8 i 10 12 14 16 -1 u=2 or u=14 1 0 0 2 4 6 8 i 10 12 14 16 -1 u=8 u=4 or u=12 1 0 0 2 4 6 8 10 12 14 16 i -1 0 2 4 6 8 10 12 14 16 i • Note that I(i) = cos (2pui/16) = cos [2p(16-u)i/16]. • Thus the highest frequency wave occurs at u = N/2 (N is even here). This will be important later. 241 Complex Exponential Image • We’ll use complex exponential functions to define the Discrete Fourier Transform. • Define the 2-D complex exponential: v   u exp -2π -1  i + j   for 0  i  N-1,0  j  M-1  N M   where -1 is the pure imaginary number. • The complex exponential allows convenient representation and manipulation of frequencies. 242 Properties of Complex Exponential • We will use the abbreviation 2p   WN = exp  - -1  N   (N = image dimension). • Hence v  u exp -2π -1  i + N M   vj j   = WNui WM  243 Complex Exponential Image • Euler's identity: and  2p WN =cos  N  ui WN =cos  2π    2p   - 1 sin    N u   u  i  - 1 sin  2π i  N   N  • The powers of WN index the frequencies of the component sinusoids. 244 Simple Properties   v  1  u vj -vj cos  2π  i + j   = WNui WM + WN-ui WM   N M  2  u v  1 vj -vj sin  2π  i + j   = -1 WNui WM - WN-ui WM 2   N M   WNui   2 u  u  2 = cos  2π i  + sin  2π i   = 1  N   N    ui 1  WN  tan sin  2π   u  u  u  i  / cos  2π i    2π i N  N  N  245 Comments • Using WNui WMvj to represent a frequency component oscillating at u (cy/im) and v (cy/im) in the i- and jdirections simplifies things considerably. • It is useful to think of WNui WMvj as a representation of a direction and frequency of oscillation. 246 Complex Exponential • The complex exponential WNui 2π   = exp  - -1 ui  N   is a frequency representation indexed by exponent ui. • Minimum physical frequencies: u = kN, k integer WN0i = WNkNi =1 for integer i • Maximum physical frequencies: u = (k+1/2)N (period 2) WN(kN+N/2)i =1  WN(N/2)i = ( 1)i (N even) 247 DISCRETE FOURIER TRANSFORM • Any N x M image I is uniquely expressed as the weighted sum of a finite number of complex exponential images: 1 N-1 M-1  -ui -vj I(i, j) = I(u, v)W   N WM NM u=0 v=0 (IDFT) • The weights I are unique. • The above is the Inverse Discrete Fourier Transform or IDFT 248 Sum of Waves Concept • The representation of images by sinusoids can be thought of in terms of interference: 249 Forward DFT • The forward transform: N-1 M-1 I(u, v) =   I(i, j) WNui WMvj (DFT) i=0 j=0 • Essentially the same form as the IDFT. I and I can be uniquely obtained from one another. • Remember that (i, j) are space indices, while (u,v) are spatial frequency indices. 250 DFT Matrix • The DFT has the same dimensions (N x M) as the image I: I =  I(u,v); 0  u  N-1,0  v  M-1 • It is a linear transformation: DFT a1I1 + a 2 I2    a L I L   a1I1 + a 2 I 2    a L I L 251 DFT Matrix Properties • The DFT is generally complex: where I  I Real  1 I Imag I (u, v)  Real N-1 M-1  i=0 v  u  I(i, j) cos 2π  N i + M   j=0 N-1 M-1 I Imag (u, v)    i=0  j   v  u  I(i, j)sin  2π  N i + M   j=0  j   252 DFT Phasor • Complex DFT has magnitude and phase: I   I(u, v) ;0  u  N-1, 0  v  M-1 I  I(u, v);0  u  N-1, 0  v  M-1 where I(u, v)  hence 2 2 I Real (u, v)  IImag (u, v) I(u, v) = tan -1  IImag (u,v)/I Real (u,v)  I(u, v) = I(u, v) exp  1I(u, v)  253 The Importance of Phase • As in 1-D the magnitude of the DFT is displayed most often. • The DFT phase usually appears unrevealing • Yet the phase is at least as important Example of Importance of Phase 254 Symmetry of the DFT • The DFT is conjugate symmetric: I(N-u, M-v) = I* (u, v) ; 0  u  N-1, 0  v  M-1 since N-1 M-1 I(N-u, M-v) =  (N-u)i (M-v)j I(i, j) W WM  N i=0 j=0 N-1 M-1 = Ni -ui Mj -vj I(i, j) W  N WN WM WM i=0 j=0 N-1 M-1 =  I(i, j) i=0 j=0 * ui vj  WN WM    = I* (u, v) 255 More Symmetry Properties • The symmetry of the DFT matrix implies that it is redundant. • We also have and I Real (N-u, M-v) = I Real (u, v) I Imag (N-u, M-v) =  IImag (u, v) I(N-u, M-v) = I(u, v) I(N-u, M-v) =  I(u, v) for 0 < u < N-1, 0 < v < M-1 256 Displaying the DFT • The DFT of an image is usually displayed as images of magnitude and of phase. • The magnitude and phase values are given gray-scale values / intensities. • The phase is usually visually meaningless. • The magnitude matrix is usually logarithmically transformed (followed by a FSCS) prior to display: log 1+ I(u, v)  257 I I I log 1 I DC   258 • Note that the coefficients of the highest physical frequencies are located near the center of the DFT matrix: near (u, v) = N/2, M/2). v (0, 0) (0, M-1) low freqs low freqs high freqs u low freqs (N-1, 0) low freqs (N-1, M-1) 259 Periodicity of the DFT • The DFT matrix is finite (N x M): I =  I(u,v); 0  u  N-1,0  v  M-1 • Yet if the indices are allowed to range outside of 0 < u < N-1, 0 < v < M-1, then the DFT is periodic with periods N and M: I(u+nN, v+mM) = I(u, v) for any integers n and m. 260 Proof of DFT Periodicity N-1 M-1 I(u+nN, v+mM) =  (u+nN)i (v+mM)j I(i, j) W WM  N i=0 j=0 N-1 M-1 = ui vj nNi mMj I(i, j) W  N WM WN WM i=0 j=0 N-1 M-1 = ui vj  I(i, j) W  N WM = I(u, v) i=0 j=0 • This is called the periodic extension of the DFT 261 (0,0) Periodic extension of DFT 262 Periodic Extension of Image • The IDFT equation 1 N-1 M-1  -ui -vj I(i, j) = I(u, v)W   N WM NM u=0 v=0 implies the periodic extension of the image as well: I(i+nN, j+mM) = I(i, j) 263 Proof of Image Periodicity 1 N-1 M-1  -u(i+nN) -v(j+mM) I(i+nN, j+mM) = I(u, v)W WM   N NM u=0 v=0 1 N-1 M-1  -ui -vj -unN -vmM = I(u, v)W WM   N WM WN NM u=0 v=0 1 N-1 M-1  -ui -vj = I(u, v)W   N WM = I(i, j) NM u=0 v=0 • When the DFT is used, it implies the periodicity of the image. This is important when the DFT is used – e.g. for convolution. 264 Periodic extension of image 265 Centering the DFT • Usually, the DFT is displayed with DC coordinate (u, v) = (0, 0) at the center. • Then low frequency info (which dominates most images) will cluster at the center of the display. • Centering is accomplished by taking the DFT of the alternating image: (-1)i+jI(i, j) • This is for display only! 266 Centering the DFT • Note that so -jM/2 (-1)i+j  (-1)i (-1) j  WN-iN/2 WM N-1 M-1 DFT (-1) I(i, j)  =  i+j i+j ui vj I(i, j) (-1) W  N WM i=0 j=0 N-1 M-1 =  I(i, i=0 j=0 N-1 M-1 = ui vj - N/2 i - M/2  j j) WN WM WN WM  u- N/2   i  I(i, j) WN i=0 j=0 = I  u-  N/2  , v-  M/2    v- M/2   j  WM 267 Shifted (centered) DFT from periodic extension 268 Centered DFT v (-N/2, -M/2) (-N/2, M/2) high high low u centered high (N/2, -M/2) high (N/2, M/2) • DFT Example DEMO 269 Computation of the DFT • Fast DFT algorithms collectively referred to as the Fast Fourier Transform (FFT). • We won’t study these – take a DSP class. • Available in any math software library. • Forward and inverse DFTs essentially identical. 270 THE MEANING OF IMAGE FREQUENCIES • Easy to lose the meaning of the DFT and frequency content in the math. • We may regard the DFT magnitude as an image of frequency content. • Bright regions in the DFT magnitude "image" correspond to frequencies having large magnitudes in the actual image. • DFT Examples: DEMO 271 IMAGE GRANULARITY • Large DFT coefficients near the origin suggest smooth image regions. • Images are positive, so DFTs usually have a large peak at (u, v) = (0, 0). • The distribution of DFT coefficients is related to the granularity / “busy-ness” of the image. 272 MASKING DFT GRANULARITY • Define toroidal zero-one masks (white = 1) low-frequency mask mid-frequency mask high-frequency mask • Masking (multiplying) a DFT with these will produce IDFT images with only low-, middle-, or high frequencies: DEMO 273 Image Directionality • Large DFT coefficients along certain orientations correspond to highly directional image patterns. • The distribution of DFT coefficients as a function of angle relative to the axes is related to the directionality of the image. 274 MASKING DFT DIRECTIONALITY • Define oriented, angular zero-one masks: • The frequency origin is at the center of each mask: DEMO 275 THE DISCRETE-SPACE FOURIER TRANSFORM (DSFT) • The DFT I of I is NOT the Fourier Transform of I! The Discrete-Space Fourier Transform:   I D (ω, λ) =   I(i, j) e  i=- j=- I(i, j) = 1 (2π) π π 2 - -1 ωi+λj -π -π ID (ω, λ) e -1 ωi+λj (DSFT) dω dλ (IDSFT) 276 Discrete-Space Fourier Transform (DSFT) • 2pPeriodic in frequency along both axes. • No implied spatial periodicity. • Continuous in spatial frequency. • Transform is asymmetric (sum / integral). 277 Relating DSFT to DFT • The DFT is obtained by sampling the DSFT: I(u, v) = I (ω, λ) 2π 2π D ω= u, λ= v  N M ; u = 0,..., N-1, v = 0,..., M-1 • Although the DFT samples the DSFT, it is a complete description of the image I. 278 Sampling • Let’s study the relationships between the DFT/DSFT and the Fourier transform of the original, unsampled image. • Digital image I is a sampled version of a continuous distribution of image intensities I C (x, y) incident upon a sensor 279 Continuous Fourier Transform • The continuous image I C (x, y) has a Continuous Fourier Transform (CFT) IC (W, L ) where (x, y)  L) are space are space coordinates and (W, frequencies:   - -1 xW +yL  IC (W, L ) = - - IC (x, y) e    1 I C (x, y) = I (W, L ) e 2 -  -   C  2π  dx dy -1 xW +yL  (CFT) dW dL (ICFT) 280 Continuous Fourier Transform (CFT) • Not periodic in frequency or space. • Continuous in spatial frequency. • Transform is symmetric (integral/integral). 281 Relating the CFT and DSFT/DFT • Assume IC (W, L ) is bandlimited, or zero outside a  certain range of frequencies: IC (W, L ) = 0 for W  W0 , L  L 0  L0 L W0 W 282 On Bandlimitedness • Any real-world image is effectively bandlimited (its CFT becomes vanishingly small for large W, L). • If it were not so the image would contain infinite energy   - - IC (x, y) 2 dx dy = 1  2π      I (W, 2 - -  C 2 L ) dW dL 283 Image Sampling • I(i, j) samples the continuous image with spacings X, Y in the x-, y-directions: I(i, j) = IC (iX, jY) ; i = 0 ,..., N-1, j = 0 ,..., M-1 • The DSFT and CFT are related by: 1   n m  I D (ω, λ) = IC  W - , L -     XY n=- m=-   X Y  (Ω,Λ) = 1 ω λ ( , ) 2π X Y 1   1  1  = I ω-2πn , λ-2πm       C XY n=- m=-   2πX 2πY  284 Relating DFT to CFT • Have: I(u, v) = I D (ω, λ) 2π 2π ω= u, λ= v  N M 1   1  u  1  v  = I -n , -m       C XY n=- m=-   X  N  Y  M   • A sum of shifted versions of the sampled CFT. It is periodic in the u- and v-directions with periods (1/X) and (1/Y), respectively. 285 1 X 1 Y u v 28 6 Unit-Period Case • We can always set X = Y = 1: I(u, v) =    u    IC  N -n  ,   n=- m=-  v   -m    M  287 Relating the Transforms SPACE SPATIAL FREQUENCY CFT Not Periodic Not sampled Not periodic Not sampled DSFT Not Periodic Sampled Periodic Not sampled Periodic Sampled Periodic Sampled DFT 288 Sampling Theorem • If W0 > (1/2X) or L0 > (1/2Y), the replicas of the CFT will overlap (sum up), distorting them. This is called aliasing. • The digital image can be severely distorted. • To avoid aliasing, the sampling frequencies (1/X) and (1/Y) and must be at least twice the highest frequencies W0 and L0 in the continuous image. 289 Comments on Sampling • A mathematical reason why images must be sampled sufficiently densely. If violated, image distortion can be visually severe. • If Sampling Theorem is satisfied, then the DFT is periodic (sampled) replicas of the CFT. 290 Aliased Chirp Image • A chirp image I(x, y) = A 1+cos(f (x, y))  =A 1+cos(ax 2 + by2 )  has instantaneous spatial frequencies (Winst , L inst ) =  φ x (x, y),φ y (x, y)  = 2  ax, by  which increase linearly away from the origin. 291 Aliased Image Sand Dune Image Centered DFT Showing Aliasing 292 Some Important Closed-Form DFTS • Only a few closed form DFTs. • Let I(i, j) = c for 0 ≤ i ≤ N-1, 0 ≤ j ≤ M-1 • Then I(u, v) = c NM δ(u, v) where  1; u=v= 0 δ(u, v) =  = unit impulse  0 ; else 293 2-D Unit Pulse Image • Let I(i, j) = c·d(i, j) • Then N-1 M-1 I(u, v) =   cδ(i, j) i=0 j=0 0 0 = cWN WM = c ui vj WN WM (constant DFT) • This is extremely important! 294 Cosine Wave Image • Let c  b I(i, j) = d  cos  2π  i +  N M • Then    d   bi cj -cj  j       WN WM  WN-bi WM    2  N-1 M-1 d   I(u, v) =      WNbi Wcj  WN-bi W -cj  WNui W vj M M  M  2  i=0 j=0   d  N-1 M-1  (u+b)i (v+c)j (v-c)j  =      WN WM  WN(u-b)i WM   2  i=0 j=0 d =   NM δ(u-b, v-c) + δ(u+b, v+c) 2 295 Sine Wave Image • Likewise, if c  b I(i, j) = d  sin  2π  i +  N M  j   then I(u, v) =  d  NM -1 δ(u-b, v-c) -δ(u+b, v+c) 2 • Sinusoids are concentrated single frequencies 296 DFT AS SAMPLED CFT • Now some CFT pairs that are either difficult to express or are too lengthy to do by hand as DFT (or even as DSFT) pairs. • If sampled adequately, these are good approximations to the DSFT/DFT (with appropriate scaling). 297 Rectangle Function • Let A B  x  y c ; x  , y  I C (x, y) = c  rect    rect   =  2 2 A  B   0 ; else  • Then IC (W, L ) = c  A  B  sinc  AW   sinc  BL   where 1  sin  πx  1; x  rect  x  =  2 and sinc(x) = πx  0 ; else 298 Sinc Function • Let IC (x, y) = c  sinc  ax   sinc  by  then W  L   c ; W  a, L  b IC (W, L ) = c  rect    rect   =    2a   2b   0 ; else 299 Gaussian Function • Let then   IC (x, y) = exp   x 2  y2 / σ 2      IC (W, L) = exp  2π 2σ 2 W 2  L 2     • The Fourier transform of a Gaussian is also Gaussian – an unusual property. • DEMO 300 Radial Basis Function Networks 301 RBF Nets • Shallow (3-layer, including the input) networks with a specific type of activation function. • Also uses a concept of a center vector. • Given an input I = {i(p); p = 1, …, P} each neuron indexed k of the hidden layer has a center vector ck. • The center vectors are initialized randomly or by clustering (e.g., using k-means). 302 RBF Nets K • The network output is then: f(i ) =  w(k)ρ  i -c k k 1  • Usually f is gaussian and the norm is Euclidean distance so ρ  i -c k  = exp  β i -c 2 k  • Usually a normalized version is used: K f(i) =  w(k)ρ  i-c  k k 1 K  ρ  i-c  k 1 k • RBF networks are also universal approximators, and perform well with a shallow architecture. 303 Support Vector Machines 304 Recall the Perceptron • It’s a basic binary classifier w(1) i(1) w(2) i(2) input i ··· i(3) i(P) ··· w(3) w(P) output y  y  sgn  w T i  b  305 Maximum Margin Solution • Assume linearly separable, normalized training data: 2 w • Then there will be a pair of parallel lines / planes / hyperplanes having maximum separation / distance d. • The distance d is the margin, with width given by the equation for a point to a line / plane / hyperplane. wTi  b   1 w i  b1 • Want separation as large as possible so ||w|| small support vectors T wTi  b  0 306 Empty Margin • Can’t have points fall within the margin, so when yp = 1 need wTip – b > 1 yp = 1 need wTip – b < -1 or yp(wTip – b) > 1 for p = 1, …, P. • The optimization is then minimize ||w|| subject to yp(wTip – b) > 1 for p = 1, …, P. Note: This is completely defined by just the support vectors! • These are hard constraints so called hard margin classifier. 307 Hinge Function • For linearly separable classes the hard classifier is the SVM! • If nonlinearly separable define the hinge loss function: max(0, 1 - yp[wTip – b)] for p = 1, …, P. where yp is the pth known label and wTip – b is the pth output. • Zero when ip is on correct side of the “margin,” else wTip – b is distance from the “margin.” Minimize the soft margin problem: 1 P T    w max 0, 1  y w i  b    p p   P p 1 2 308 Kernel Trick • Brilliant idea due to Vladimir Vapnik: use a kernel to cast the data into a higher dimensional feature space, where the data classes may nicely separate! Nonlinear mapping to a highdimensional space. 309 Kernel Trick • HOW: The details are beyond our class, but for a start, the minimization of the linear SVM functional can be solved by transforming it into a quadratic optimization problem. • This has the great advantage of avoiding local minima of the original functional. • Solving the quadratic problem requires transforming it into a dual problem which is expressed in terms of dot products between input samples ip and iq, p, q = 1, …, P. • The kernel trick is to replace these dot products by kernel functions of the data: ip  iq  k(ip, iq). • Most commonly k is gaussian: exp(l||ipiq||2). 310 SVM Summary • The “deep network” of its day. Extremely popular and used in numerous image analysis problems. We’ll see some later. • Can handle very high-dimensional problems with ease – such as image analysis! • Computationally easy since a shallow network architecture! • Overtraining much less of an issue than deep networks. • Still widely used when there isn’t enough data to train a deep neural network or when the problem isn’t too difficult. 311 Comments • We now have a basic understand of frequency-domain concepts • Let’s put them to use in linear filtering applications… onward to Module 4. 312 Module 4 Linear Image Filtering      Wraparound and Linear Convolution Linear Image Filters Linear Image Denoising Linear Image Restoration (Deconvolution) Filter Banks QUICK INDEX 313 WRAPAROUND CONVOLUTION • Modifying the DFT of an image changes its appearance. For example, multiplying a DFT by a zero-one mask predictably modifies image appearance: 314 Multiplying DFTs • What if two arbitrary DFTs are (pointwise) multiplied or one (pointwise) divides the other? J = I1  I 2 or J = I1  I 2 • The answer has profound consequences in image processing. • Division is a special case which need special handling if I 2contains near-zero or zero values. 315 Multiplying DFTs • Consider the product J = I1  I 2 • This has inverse DFT 1 N-1 M-1  -ui -vj J(i, j) = J(u, v)W   N WM NM u=0 v=0 1 N-1 M-1  -vj = I1 (u, v)I 2 (u, v)WN-ui WM   NM u=0 v=0 1 N-1 M-1  N-1 M-1  N-1 M-1  -ui -vj um vn   up vq  = I (m, n)W W I (p, q)W W     1 N M   2 N M WN WM NM u=0 v=0  m=0 n=0   p=0 q=0  316 heck 317 N-1 M-1 N-1 M-1 1 N-1 M-1 u  p+m-i  v  q+n-j = I1 (m, n)   I 2 (p, q)   WN WM   NM m=0 n=0 p=0 q=0 u=0 v=0 N-1 M-1 1 N-1 M-1 = I1 (m, n)   I 2 (p, q)  NM  δ(p+m-i, q+n-j)   NM m=0 n=0 p=0 q=0 = N-1 M-1   I1 (m, n) I 2  i-m  N ,  j-n M  m=0 n=0 N-1 M-1 =  I1  i-p  N ,  j-q M  I 2 (p, q) p=0 q=0 = I1 (i, j)  I 2 (i, j) I1  I2 is the wraparound convolution of I1with I 2 Note:  p  N = p mod N 318 Wraparound Convolution • The summation J(i, j) = I1 (i, j)  I 2 (i, j) = N-1 M-1   I1 (m, n) I2  i-m  N ,  j-n M  m=0 n=0 is also called cyclic convolution and circular convolution. • Like linear convolution, it is an inner product between one sequence and a (doubly) reversed, shifted version of the other – except with indices taken modulo-M,N. 319 Depicting Wraparound Convolution • Consider hypothetical images I1 and I 2 j (0,0) (0,0) i Image I 1 Image I 2 at which we wish to compute the cyclic convolution at (i, j) in the spatial domain (without DFTs). 320 • Without wraparound: (0,0) j i I 2 Doubly-reversed and shifted (N-1, M-1) • Modulo arithmetic defines the product for all 0 < i < N-1, 0 < j < M-1. 321 (0,0) i j If one 2D function is filtering the other, then it filters together the left/right and top/bottom sides of the image! (N-1, M-1) Overlay of periodic extension of shifted I 2 Summation occurs over 0 < i < N-1, 0 < j < M-1 322 LINEAR CONVOLUTION • Wraparound convolution is a consequence of the DFT, which is a sampled DSFS. • If two DSFTs are multiplied together: J D (ω, λ) = I D1 (ω, λ)I D2 (ω, λ)    then useful linear convolution results: J(i, j) = I1 (i, j)  I 2 (i, j) • Wraparound convolution is an artifact of sampling the DSFT – which causes spatial periodicity. 323 About Linear Convolution • Most of circuit theory, optics, and analog filter theory is based on linear convolution. • And … (linear) digital filter theory also requires the concept of digital linear convolution. • Fortunately, wraparound convolution can be used to compute linear convolution. 324 Linear Convolution by Zero Padding • Adapting wraparound convolution to do linear convolution is conceptually simple. • Accomplished by padding the two image arrays with zero values. • Typically, both image arrays are doubled in size: 325 0 Image I1 (zero padded) 0 Image I2 (zero padded) 2N x 2M zero padded images • Wraparound eliminated, since the "moving" image is weighted by zero values outside the image domain. • Can be seen by looking at the overlaps when computing the convolution at a point (i, j): 326 Wraparound Cancelling Visualized Linear convolution by zero padding • Remember, the summations take place only within the blue shaded square (0 ≤ i ≤ 2N-1, 0 ≤ j ≤ 2M-1). 327 DFT Computation of Linear Convolution • Let J  , I1 , I2 be zero - padded 2N  2M versions of J  I1  I 2 . Then if J  = I1  I2 = IFFT2N 2M  FFT2N 2M I1  FFT2N2M I2  then the NxM image with elements J(i, j) = J (i, j) ; N 2  1 i  3N M 2 , 2  1 j  3M 2 contains the linear convolution result. 328 On DFT-Based Linear Convolution • By multiplying zero-padded DFTs, then taking the IFFT, one obtains J = I I 1 • • • 2 The linear convolution is larger than NxM (in fact 2Nx2M) but the interesting part is contained in NxM J. To convolve an NxM image with a small filter (say PxQ), where P,Q < N,M: pad the filter with zeros to size NxM. If P,Q << N,M, it may be faster to perform the linear convolution in the space domain. 329 Direct Linear Convolution • Assume I1 and I 2 are not periodically extended (not using the DFT!), and assume that I1 (i, j)  I 2 (i, j) = 0 whenever i < 0 or j < 0 or i > N-1 or j > M-1. • In this case J(i, j) = I1 (i, j)  I 2 (i, j) = N-1 M-1   I1 (m, n) I2  i-m, j-n  m=0 n=0 330 LINEAR IMAGE FILTERING • A process that transforms a signal or image I by linear convolution is a type of linear system. optical image I series of lenses with MTF transformed image J=I *H H electrical current MTF = modulation transfer function digital image I lumped electrical circuit with IR I H output current J = I*H IR = impulse response digital image filter output image H J=I *H Of interest to us 331 Goals of Linear Image Filtering • Process sampled, quantized images to transform them into - images of better quality (by some criteria) - images with certain features enhanced - images with certain features de-emphasized or eradicated 332 impulse noise gaussian white noise blur Albert JPEG compression Variety of Image Distortions 333 Characterizing Linear Filters • Any linear digital image filter can be characterized in one of two equivalent ways: (1) The filter impulse response H =  H(i, j)   =  H(u,  v)  (2) The filter frequency response H   • These are a DFT pair:  = DFT  H  H  H = IDFT  H  334 Frequency Response • The frequency response describes how the system effects each frequency in an image that is passed through the system. • Since  v) = H(u,  v) exp H(u,    v) 1H(u, an image frequency component at (u, v) = (a, b) is  b) and amplified or attenuated by the amount H(a,  b) shifted by the amount H(a, 335 Frequency Response Example • The input to a system H is a sine image: c  b I(i, j) = cos  2π  i +  N M   1 cj -cj j   = WNbi WM +WN-bi WM  2  • The output is 1 N-1 M-1 -b i-m -c j-n b i-m c j-n J(i, j)=H(i,j)  I(i, j)=   H(m, n)  WN   WM  +WN  WM     2 m=0 n=0 N-1 M-1 1 -bi -cj N-1 M-1 bm cn 1 bi cj -cn  WN WM   H(m, n)WN WM  WN WM   H(m, n) WN-bm WM 2 2 m=0 n=0 m=0 n=0 1 -cj  cj   c) cos 2π  b i+ c j  H(b,  c)   WN-bi WM H(b, c)  WNbi WM H(-b, -c) = H(b,  N M   2     336 Impulse Response • The response of system H to the unit impulse 1;i = j = 0 δ(i, j) =  0; else • An effective way to model responses since every input image is a weighted sum of unit pulses I(i, j) = N-1 M-1 N-1 M-1 m=0 n=0 m=0 n=0   I  i-m, j-n  δ(m, n)=   I  m, n  δ(i-m, j-n) 337 Linear Filter Design • Often a filter is to be designed according to frequency-domain specifications. • Models of linear distortion in the continuous domain lead to linear digital solutions. 338 Sampled Analog Specification • Given an analog or continuous-space spec:  H C (x, y)  H C (W, L )  • Sampled in space (X=Y=1): H(i, j) = H C (i, j) ; - < i, j <  (1) • DFT [by sampling DSFT of (1)]:  v) = H(u,    u   v  H -n , -m    C  N   M       n=- m=- (2) 339 Simple Design From Continuous Prototypes • Two simplest methods of designing linear discrete-space image filters from continuous prototypes: (1) Space-Sampled Approximation (2) Frequency-Sampled Approximation • Derive from formulae (1), (2) on previous slide. 340 Space-Sampled Approximation • Truncate (1): Htrunc (i, j) = HC (i, j) for 0 < |i| < (N/2)-1, 0 < |j| < (M/2)-1. • A truncation of the analog spec - Gibbs phenomena will occur at jump discontinuities of H C (W, L )  • The frequency response is IDFT  H trunc (u, v)  H trunc (i, j) 341 Frequency Sampled Approximation • Use m= n = 0 term in (2) – assuming negligible aliasing u v  H fs (u, v) = H C  ,   N M for 0 < |u| < (N/2)-1, 0 < |v| < (M/2)-1. • The DSFT is NOT specified between samples • CFT is centered and non-periodic. • The discrete impulse response is then DFT  (u, v) H fs (i, j)  H fs 342 Low-Pass, Band-Pass, and High-Pass Filters • The terms low-pass, band-pass, and high-pass are qualitative descriptions of a system's frequency response. • "Low-pass" - attenuates all but the "lower" frequencies. • "Band-pass" - attenuates all but an intermediate range of "middle" frequencies. • "High-pass" - attenuates all but the "higher" frequencies. • We have seen examples of these: the zero-one frequency masking results. 343 Generic Uses of Filter Types • Low-pass filters are typically used to - smooth noise - blur image details to emphasize gross features • High-pass filters are typically used to - enhance image details and contrast - remove image blur • Bandpass filters are usually special-purpose 344 Example Low-Pass Filter • The gaussian filter with frequency response hence   2 H C (W, L ) = exp  -2  πσ  W2  L 2      2 2  u 2  v 2    H(u, v) = exp -2π σ         N   M     which quickly falls at larger frequencies. • The gaussian is an important low-pass filter. 345 Gaussian Filter Profile 1.0 1.0 0 0 u u N = 32, s = 1 N = 32, s = 1.5 Plots of one matrix row (v = 0) DEMO 346 Example Band-Pass Filter • Can define a BP filter as the difference of two LPFs identical except for a scaling factor. • A common choice in image processing is the difference-of-gaussians (DOG) filter:     2 2 H C (W, L ) = exp  -2  πσ  W2  L 2  - exp  -2  Kπσ  W2  L 2       hence    u 2  v 2    u 2  v 2   2 2  H(u, v) = exp -2  πσ         - exp -2  Kπσ            N   M     N   M    • Typically, K ≈ 1.5. 347 DOG Filter Profile 1.0 N=32 0 u • DOG filters are very useful for image analysis – and in human visual modelling. • DEMO – Take K=1.5, s < 5 348 Example High-Pass Filter • The Laplacian filter is also important hence   H C (W, L ) = A W2  L 2   u 2  v 2   H(u, v) = A        N   M   although this is a severely truncated approximation! Best used in combination with another filter (later). • An approximation to the Fourier transform of the continuous Laplacian: 2 2   2 =  2 2 x y 349 Laplacian Profile 1.0 A = 4.5, N = 32 0 u • DEMO 350 LINEAR IMAGE DENOISING • Linear image denoising is a process to (try to) smooth noise without destroying the image information. • The noise is usually modeled as additive or multiplicative. • We consider additive noise now. • • Multiplicative noise is better handled by a homomorphic filtering that uses nonlinearity. 351 Additive White Noise Model • Model additive white noise as an image N with highly chaotic, unpredictable elements. • Can be thermal circuit noise, channel noise, sensor noise, etc. • Noise may effect the continuous image before sampling: J C (x,y) = IC (x,y) + N C (x,y)      observed original white noise 352 Zero-Mean White Noise • The white noise is zero-mean if the limit of the average of P arbitrary noise image realizations vanishes as P → ∞: 1 P N C,p (x, y)  0 for all (x,y) as P    P p=1 • On average, the noise falls around the value zero.* *Strictly speaking, the noise is also "mean-ergodic." 353 Spectrum of White Noise • The noise energy spectrum is N C (W, L ) = N C (x, y)  • If the noise is white, then, on average, the energy spectrum will be flat (flat spectrum = ‘white’): 1 P N C,p (W, L )  η for all (W, L ) as P    P p=1  • Note: η2 is called noise power. 354 White Noise Model • White noise is an approximate model of additive broadband noise: J C (x,y) = IC (x,y) +   observed original I C(x) N C (x,y)    broadband noise NC (x) + x x I C (W )  + W Just a depiction – Magnitudes aren’t added N C (W )  W 355 Linear Denoising • Objective: Remove as much of the high-frequency noise as possible while preserving as much of the image spectrum as possible. • Generally accomplished by a LPF of fairly wide bandwidth (images are fairly wideband): H C (W )  0 W 356 Digital White Noise • We make a similar model for digital zero-mean additive white noise: J = I + N  observed original noise • On average, the elements of N will be zero. • The DFT of the noisy image is the sum of the DFTs of the original image and the noise image:  J = I + N observed original noise • On average the noise DFT will contain a broad band of frequencies. 357 White Noise Maker Demo Denoising - Average Filter • To smooth an image: replace each pixel in a noisy image by the average of its M x M neighbors: 1/M 2 1/M 2 0 3x3 window 1/M 2 2 1/M convolution template 358 Average Filter Rationale • Averaging elements reduces the noise mean towards zero. • The window size is usually an intermediate value to balance the tradeoff between noise smoothing and image smoothing. • Typical average filter window sizes: L x L = 3 x 3, 5 x 5 ,..., 15 x 15 (lots of smoothing), e.g. for a 512 x 512 image. 359 Average Filter Rationale • Linear filtering the image (with zero-padding assumed hereafter) K = H J = H I + H* N  =H   J = H   I + H  N  K will affect image / noise spectra in the same way:  H(u)  H(u) 1 1 smaller L u larger L DEMO u 360 Denoising – Ideal Low-Pass Filter • Also possible to use an ideal low-pass filter by designing in the DFT domain:  1 ; if u 2 +v 2  U cutoff  H(u, v) =   0 ; otherwise • Possibly useful if its possible to estimate the highest important radial frequency U cutoff in the original image. 361 Ideal LPF (0, 0) u v U cutoff (N-1, M-1) centered DFT DEMO 362 Denoising - Gaussian Filter • The isotropic Gaussian filter is an effective smoother:  2 2  u 2 +v 2    H(u, v) = exp  -2π σ  2    N   • It gives more weight to “ closer” neighbors. • DFT design: Set the half-peak bandwidth to U cutoff by solving for s:   2  U 1 exp -2π 2σ 2  cutoff  2    N  2 DEMO  N   N   σ=  log 2  0.19    πU cutoff   U cutoff 363  Summary of Smoothing Filters Space • Average filter: Noise leakage through frequency ripple (spatial discontinuity) • Ideal LPF: Ringing from spatial ripple (frequency discontinuity) • Gaussian: No discontinuities. No leakage, no ringing. Frequency H(i)  H(u) H(i)  H(u) H(i)  H(u) 364 Minimum Uncertainty • Amongst all real functions and in any dimension, the noripple Gaussian functions uniquely minimize the uncertainty principle: 2  xf (x, y) 2 dx   uf (u, v) du  1       f (x, y) 2 dx   f (u, v) 2 du  4     • Similar for y, v. • They have minimal simultaneous space-frequency durations. 365 LINEAR IMAGE DEBLURRING • Often an image that is obtained digitally has already been corrupted by a linear process. • This may be due to motion blur, blurring due to defocusing, etc. • We can model such an observed image as the result of a linear convolution: J C (x, y) = G C (x, y)  I C (x, y)          so observed linear distortion original J C (W, L ) = G C (W, L )  IC (W, L )             observed linear distortion original 366 Digital Blur Function • The sampled image will then be of the form (assuming sufficient sampling rate) hence J = G I   I J = G • The distortion G is almost always low-pass (blurring). • Our goal is to use digital filtering to reduce blur – a VERY hard problem! famous example 367 Deblur - Inverse Filter • Often it is possible to make an estimate of the distortion G. • This may be possible by examining the physics of the situation. • For example, motion blur (relative camera movement) is usually along one direction. If this can be determined, then a filter can be designed. • The MTF of a camera can often be determined – and hence, a digital deblur filter designed. 368 Deconvolution • Reversing the linear blur G is deconvolution. It is done using the inverse filter of the distortion:  G inverse (u, v) = 1  G(u, v)  provided that G(u, v)  0 for any (u, v). • Then the restored image is:    I  I !  G K  G inverse 369 Blur Estimation • An estimate of blur G might be obtainable. • The inverse of low-pass blur is high-pass: 1.0 80 60 40 20 u Gaussian distortion 0 u Inverse filter • At high frequencies the designer must be careful! • Note: The inverse takes value 1.0 at (u, v) = (0,0) DEMO 370 Deblur - Missing Frequencies • Unfortunately, things are not always so "ideal" in the real world. • Sometimes the blur frequency response takes zero value(s). • If  , v ) = 0 for some (u , v ), then G  G(u 0 0 0 0 inverse (u0 , v 0 ) =  which is meaningless. 371 Zeroed Frequencies • The reality: any frequencies that are zeroed by a linear distortion are unrecoverable in practice (at least by linear means) - lost forever! • The best that can be done is to reverse the distortion at the non-zero values. • Sometimes much of the frequency plane is lost. Some optical systems remove a large angular spread of frequencies: unrecoverable "zeroed" frequencies (0, 0) Frequency Domain 372 Pseudo-Inverse Filter • The pseudo-inverse filter is defined    1/G(u, v) ; if G(u, v)  0  G p-inverse (u, v) =   ; if G(u, v)  0 0 • Thus no attempt is made to recover lost frequencies. • The pseudo-inverse is set to zero in the known region of missing frequencies – a conservative approach. • In this way spurious (noise) frequencies will be eradicated. DEMO (Noisy case: use snoise = 1, sblur < 1) 373 Deblur in the Presence of Noise • A worse case is when the image I is distorted both by linear blur G and additive noise N: J = G I + N • This may occur, e.g., if an image is linearly distorted then sent over a noisy channel. • The DFT:   I + N  J = G 374 Filtering a Blurred, Noisy Image • Filtering with a linear filter H will produce the result or K = H  J = H G I + H  N   I + H  H   J = H  G  N  K • The problem is that neither a low-pass filter (to smooth noise, but won't correct the blur) nor a highpass filter (the inverse filter, which will amplify the noise) will work. 375 Failure of Inverse Filter • If the inverse filter were used, then or K = Ginverse  J = I + Ginverse  N   G     K inverse  J = I + G inverse  N • In this case the blur is corrected, but the restored image has horribly amplified high-frequency noise added to it. 376 Wiener Filter • The Wiener filter (after Norbert Wiener) or minimum-meansquare-error (MMSE) filter is a “best” linear approach. • The Wiener filter for blur G and white noise N is  G Wiener (u, v) =   (u, v) G 2  G(u, v) + η2 • Often the noise factor h is unknown or unobtainable. The designer will usually experiment with heuristic values for h. • In fact, better visual results may often be obtained by using values for h in the Wiener filter. 377 Wiener Filter Rationale • We won’t derive the Wiener filter here. But: • If h = 0 (no noise), the Wiener filter reduces to the inverse filter:   (u, v) G 1  G Wiener (u, v) =  G(u, v) 2 =  G(u, v) which is highly desirable. 378 Wiener Filter Rationale  v) = 1 for all (u, v) (no blur) the Wiener filter • If G(u, reduces to:  G Wiener (u, v) = 1 1+ η2 which does nothing except scale the variance so that the MSE is minimized. • So, the Wiener filter is not useful unless there is blur. 379 Pseudo-Wiener Filter • Obviously, if there are frequencies zeroed by the linear distortion G then it is best to define a pseudo-Wiener filter:   (u, v)  G  ; if G(u, v)  0  2 2   G (u, v) = G(u, v) + η  Wiener   ; if G(u, v) = 0 0 • Noise in the "missing region" of frequencies will be eradicated. • DEMO (sblur < 4, snoise < 10) 380 APPLICATION EXAMPLE: OPTICAL SERIAL SECTIONING • Optical systems often blur images: visible light blurred image scene optical system one possible solution: blurred image digitize "inverse blur" computer program deblurred image 381 Optical Sectioning Microscopy incremental vertical translation of microscope (step motor) region of best focus • A very narrow-depth of field microscope. One image taken at each focusing plane, giving a sequence of 2-D images - or 3-D image of optical density. 382 3-D Image of Optical Density DFT   3-D image of optical density Magnitude of 3-D DFT 383 3-D Optical System Analysis • • In this system three effects occur: (1) A linear low-pass distortion G. (2) A large biconic region of frequencies aligned along the optical axis is zeroed. (3) Approximately additive white noise. Items (1) and (2) shown using principles of geometric optics. Item (3) shown empirically. 384 3-D Biconic Spread of Lost Frequencies 3D Frequency Domain (0, 0, 0)) Region of zeroed 3-D frequencies • Note that DC (u, v, w) = (0, 0, 0) is zeroed also. Hence the background level (AOD) is lost. • There is no linear filtering way to recover this biconic region of frequencies. 385 3-D Restoration • So: the 3-D images are blurred, have a large 3-D region of missing frequencies, and are corrupted by low-level white noise added. • The processed results show the efficacy of - pseudo-inverse filtering - pseudo-Wiener filtering applied to two optical sectioned 3-D images: - a pollen grain - a pancreas Islet of Langerhans (collection of cells) • EXAMPLES 386 Filter Banks 387 Generic Filter Banks H0 D U G0 H1 D U G1 Image I S HP -1 D Bandpass Down“analysis” samplers filter bank U Image Processing Upsamplers Reconstructed Image  I GP -1 Bandpass “synthesis” filter bank • By design of filters Hp the image can be recovered closely or exactly (a “discrete wavelet transform” or DWT) … if nothing is done in-between. • There may or may not be down- and up-samplers. 388 Two Views of Filter Banks • There are many ways of looking at filter banks. • We will look at two: (1) So-called perfect reconstruction filter banks. Typically maximally sampled for efficiency. (2) Filter banks, w/o perfect reconstruction and perhaps w/o sampling. Frequency analyzer filter banks. • The view (1) is important when complete signal integrity is required, e.g., compression, denoising, etc • The view (2) is useful/intuitive when doing image analysis. • Filter banks can fall in either or both categories 389 Perfect Reconstruction Filter Banks 390 Up- and Down-Sampling 1D Down-sampler I(i) D J(i) J(i) = I(iD) Throw away D-1 of every D samples) • • • • 1D Up-sampler I(i) U L(i) = L(i)  I(i/U) ; i = kU  ; else  0 Insert U-1 zeros after every sample Usually D = U NOT inverses of each other. Down-sampling throws away information Up-sampling does not add information 391 Sampling • Critical sampling (D = P). Same number of output points as input. Possibly no info lost. H0 D H1 D HP-1 D Image I • Oversampling (D < P). More points than input image. May provide resilience. • Undersampling (D > P). Fewer points than input image. Info lost. • Unsampled (D = 1). Highly redundant. 392 Analysis Filters • Typically have separable impulse responses: Hp(i, j) = Hp(i)Hp(j) so a filtered image is: J(i, j) = Hp(i, j)*I(i, j) = Hp(i)*[Hp(j)*I(i, j)] a 1-D convolution along columns then rows (or vice-versa). 393 Analysis Filter Types • Idea: divide the frequency band along each axis: p=0 p=1 p=2 p=3  (u) H p P=4 0 u   N-1 • Analysis/process image information (compress, feature extract, noise remove, etc) in each band. 394 Two Band Case p=1 p=0  (u) H p 0 “Low” band P=2 “High” band u   N-1 • The two-band case is easy to analyze. • We can use it to efficiently build filter banks. 395 Two-band Decomposition “Low” band along rows Image I “Low” band H0 2 2 H1 2 2 “High” band along rows G0 G1 Reconstructed Image S  I “High” band • Naturally subsampling causes aliasing. • However, by proper choice of filters Hp and Gp the aliasing can be canceled.  • In fact they can be chosen so that I = I . • Filters Hi and Gi must be length N = a multiple of 2. • Repeat along columns 396 Dyadic Sampling 1D Down-sampler I(i) 2 J(i) J(i) = I(2i) Throw away every other sample  = (1/2)  I(u)  I(u+N/2)  J(u)   for 0  u  (N/2)-1 (length N/2) 1D Up-sampler I(i) 2 L(i)  I(i/2) ; i = 2k; k an integer L(i) =  0 ; else  Insert a zero after every sample  = I(u) L(u) N for 0  u  2N-1 (length 2N) 397 Perfect Reconstruction Filters (1D)  • If I = I then H0, H1 and G0, G1 are proper wavelet filters. • They obey the perfect reconstruction property  (u) + H  (u) = 2 for all 0  u  N-1  (u)G  (u)G H 0 0 1 1 and  (u) + H  (u) = 0 for all 0  u  N-1  (u+N/2)G  (u+N/2)G H 0 0 1 1 • Perfect reconstruction is important for many applications: when all the image information is needed. 398 PR Wavelet Filters (1D)  (u) where H (N/2) = 0 • Given the LP analysis filter H0(i)  H 0 0 (zero at highest frequency) • Define the LP synthesis filter G0(i) = H0(N-1-i) hence  (u) = H  (N - u) G 0 0 • Define the HP analysis filter H1(i) = (-1)iH0(N-1-i) hence  (u) = H  (N/2 - u) H 1 0 • And the HP synthesis filter G1(i) = (-1)iH0(i) hence  (u) = H  (u - N/2) G 1 0 • The synthesis filters are reversed versions of the analysis filters (hence mirror filters). The HP filters are frequencyshifted (quadrature) versions of the LP filters. 399 Perfect Reconstruction Condition • In this case the perfect reconstruction condition reduces to: 2 2   H 0 (u) + H1 (u) = 2 for all 0  u  N-1 or 2 2   H 0 (u) + H 0 (N/2-u) = 2 for all 0  u  N-1 400 Comments • Can show that the PR condition on H0 implies that the DWT expansion has an orthogonal basis. • Finite-length discrete PR filters exist: most notably the (lowpass) Daubechies filters, which are maximally smooth at w = p. • However, the wavelet filters in orthogonal DWT cannot be even symmetric about their spatial center (i.e., not linear or zero phase) • This advantage is regained using bi-orthogonal wavelets (later). 401 PR Wavelet Filters • There are an infinite number of wavelet filters. • We are interested in finite-length, PR, discrete wavelets. • Can start with any low-pass filter with  (0) = 2 and H  (N/2) = 0 H 0 0 • Example: (Haar 2-tap filter)  1/ 2 ; i = 0  h 0 (i) =  1/ 2 ; i =1  0; else   1/ 2 ; i = 0  h1 (i) =  -1/ 2 ; i =1  0; else  402 Subband Filters • Daubechies 4-tap (D4) filter:  1+   3+ 1  h 0 (n) =  34 2  10  3 ; i =0 H 0 (w) H1 (w) G 0 (w) G1 (w) 3 ; i =1 3 ; i=2 3 ; i=3 ; else DSFTs Ingrid Daubechies Daubechies orthonormal wavelets are popular owing to their nice function approximation properties (makes them good for compression, denoising, interpolation, etc) 403 Daubechies 4 Wavelet (D4) Filters Analysis LP Synthesis LP Analysis HP Synthesis HP (impulse responses) 404 Daubechies 8 Wavelet (D8) Filters Analysis LP Synthesis LP Analysis HP Synthesis HP (impulse responses) 405 Daubechies 16 Wavelet (D16) Filters Analysis LP Synthesis LP Analysis HP Synthesis HP (impulse responses) 406 Daubechies 32 Wavelet (D32) Filters Analysis LP Synthesis LP Analysis HP Synthesis HP (impulse responses) 407 Daubechies D40 Wavelet (D40) Filters Analysis LP Synthesis LP Analysis HP Synthesis HP (impulse responses) 408 2-D Wavelet Decompositions • 2-D analysis filters (implemented separably) decompose images into high- and low-frequency bands. • The information in each band can be analyzed separately. (0, 0) Low, Low High, Low Low, High High, High 409 Multi-Band 1D Discrete Wavelet Transform • Filter outputs are the wavelet coefficients. Low • Subsampling each filter output yields exactly NM nonredundant DWT coefficients. The image can be exactly reconstructed from them. • • • Sub-sampling heavier at lower frequencies. Multiple bands (> 2) created by iterated filtering on the lowfrequency bands. Apply to both the rows and columns for 2D NxM Image I Low High H0 H1 2 2 Low High H0 H1 2 2 High H0 H1 2 2 410 1D Wavelet Packet Transform Image I Low Low High H0 H1 2 2 Low High Low High H0 H1 H0 H1 2 2 2 2 High Low High Low High Low High H0 H1 H0 H1 H0 H1 H0 H1 2 2 2 2 2 2 2 2 411 1-D Hierarchies Discrete Wavelet Transform (octave bandwidths) Wavelet Packet Decomposition (linear bandwidths) 412 Linear vs Octave Bands • A bandpass filter with upper and lower bandlimits uhigh and ulow : u low uhigh • Linear bandwidth is Blinear = uhigh - ulow (cy/image) • Octave bandwidth (octaves) is Boctave = log2(uhigh) – log2(ulow) • A one-octave filter has uhigh = 2ulow. • The DWT hierarchy gives a filter bank with constant Boctave 413 2-D Wavelet Decompositions • Frequency division usually done iteratively on the lowfrequency bands. • Two examples and the DWPT: (0, 0) (0, 0) (0, 0) LLHL HL LLLH LLHH LH HH "Pyramid" Wavelet Transform "Tree-Structured" Wavelet Transform Wavelet Packet Transform 414 Pyramid Wavelet Decomposition Image Two-Level Wavelet Decomposition 415 1D Inverse DWT 2 2 G0 G1 S 2 2 G0 G1 S 2 2 G0 G1 S  I 416 1D Inverse DWPT 2 2 2 2 2 2 2 2 G0 G1 G0 G1 G0 G1 G0 G1 S S S S 2 2 2 2 G0 G1 G0 G1 S S 2 2 G0 G1 S  I 417 Frequency Analyzer Filter Banks 418 Frequency Analyzer Construction • Ignore sub-sampling / perfect reconstruction (can still do both) • Start with a lowpass filter H0(n) where  (0) 1 and H  (N/2)  0 H 0 0 • Form cosine-modulated versions: Gp(n) = H0(n)cos(2pupn/N) • Then 1     u + u  G p (u)   H 0  u - u p   H 0 p  2 are bandpass filters. 419 Frequency Analyzers • There are many ways to choose the bandpass filters. • Things to decide include – – – – – Filter shapes Filter center frequencies Filter bandwidths Filter separation Filter orientations 420 Filter Shapes • Infinite possibilities • For image analysis the minimum-uncertainty gaussian shape is often desirable. • A 2-D bandpass filter with (shifted) gaussian shape: 2 2 2 2 2 2  2 ps u  u  v  v  2 ps u  u  v  v          p p  p p   1        (u, v)  e G  e  p 2   G(i, j) G(u, v) and thus cosine-modulated in space:   up vp 1  i2  j2  /2 s2 G p (i, j)  e cos  2p  i  2 2ps M  N  j   • This is called a Gabor filter. We’ll see many of these later! 421 Filter Center Frequencies and Bandwidths • Option 1: – Constant (linear) filter separation – Constant linear bandwidths • Option 2: – Constant logarithmic (octave) filter separation – Constant logarithmic (octave) bandwidths 422 Octave Filter Bank • Given a 1st bandpass filter G1(i) = H(i)cos(u1i) with c.f. u1 = (u1,hi + u1,lo)/2 • For p > 1, define Gp+1(i) = H(i/2) cos(2upi). Its center frequency is 2up and lower / upper bandlimits up+1,hi = 2up,hi and up+1,lo = 2up,lo • The (octave) bandwidths are the same: log2(2up,high/2up,low) = log2(up,high /up,low). 2up,lo up,lo up up,hi 2up,hi up+1 = 2up 423 Filter Separation • The basic idea is to cover frequency space. • With filter responses that are large enough – e.g., at half-peak amplitude* or greater. • Adjacent filters can be defined to intersect at half-peak. This ensures high SNR. (Note special handling of “baseband” filter) *Half-peak bandwidth is a common image processing bandwidth measure (as opposed to half-power / 3dB BW). 424 Half-Peak Intersection Example • Consider a Gabor bandpass filter: 2 2   G p (u)  exp  2  ps   u  u p     and adjacent filter 1 octave away: 2 2   G p 1 (u)  exp  2  ps / 2   u  2u p     • Exercise: Show that these filters intersect at half-peak if up  3 ln 2 2  ps  2 • Also show that these filters have 1 octave (half peak) BW 425 Filter Orientations • The tesselation of the 2-D frequency plane involves sets of filters having equal radial c.f.s and equal orientations. • The filters typically have equal radial and orientation BWs. • Adjacent filters can again intersect at half-peak. 426 Constant Octave Gabor Filterbank • • • • • Real-valued filters 1 octave filters Half-peak intersection 9 orientations 4 scales • • • • • • Complex-valued filters 1 octave filters Half-peak intersection 8 orientations 5 scales “Extra” filters at corners 427 Convolutional Neural Networks or ConvNets or CNNs 428 ConvNets • Actually an old idea from the 1990s generally attributed to Yann Lecun. • The simple idea is to limit the number of inputs from prior layers that each neuron receives. • This creates layers that are only partially connected instead of fully connected. 429 Revisit the Multilayer Perceptron • Recall the basic MLP from earlier with a single hidden layer: i(1) i(2) input i i(3) ··· ··· outputs y i(P) input layer weights w2 output weights layer R  f   w q (r)i q (r)   r 1  w1 • But let’s feed it image pixels, instead of “handcrafted” features. • For a small 1024 x 1024 (P = 220 pixels) image, the number of weights w2 in the hidden layer is PQ, where Q is the number of nodes in the hidden layer. If P = Q, that’s 240 = 1.1 x 1012  1.1 Trillion weights to optimize via backprop!!! 430 Convolutional Layer Idea • Suppose we only feed a few image pixels to each node: 0 i(1) i(2) input i i(3) 0 i(P) input layer weights w2 ··· ··· outputs y output weights layer • Notice the endpoint zero padding. • The nodes are identical. • Same number of nodes as inputs. • The weight set used for the hidden layer is called the filter. w1 • In fact suppose we only feed 3 pixel values to each hidden node. Then if P = Q there are only 3P weights to optimize. • For the same image that’s 3 x 220  3.1 Million … lots better but still many! • However, in a convolution layer, each group of weights are constrained to be identical. So in this example there are only 3 different weights (+1 bias = 4). 431 Stride Length • Don’t have to have as many filters as inputs. Can skip by a fixed stride: 0 i(1) i(2) input i i(3) 0 i(P) input layer weights w2 ··· ··· outputs y output weights layer w1 • Effectively subsampling the full output. Without subsampling, stride = 1. In the above, stride = 2. Stride can be larger to reduce computation. • As before each node includes a nonlinear activation function. 432 2-D Convolutional “Receptive Field” • Every node in hidden layer receives inputs from a local neighborhood. • These are weighted and summed. • The weight set (filter) is identical for each node. • If stride = 1 identical to 2D convolution with the filter. Input could be - image pixel values Similar to feed-forward from - retinal sensors - neurons in cortex to other neurons. CNNs often called “biologically inspired” 433 Stride of 2 (Most Common) • In a 2-D image, if stride = S, then computation is divided by S2. • Usually best to retain full resolution in the early layers (we will soon stack these!) to extract as much low level feature information as possible. 434 Simpler Diagram M m n N Image Filter Feature Map Sum & Activation Output (next layer) • Output called a feature map. Fed to the next layer. • When stacked, early layers extract low-level features. • Zero padding the input ensures the feature map will also be N x M (stride = 1) 435 Feature Maps • A convolutional layer produced by a fixed filter only contains one feature type – say oriented in some direction, at some scale. • The network may “decide” to learn a specific 2D bandpass filter (and in fact, they do!). • One feature map not enough! • Instead generate multiple (Q) feature maps! • Here Q = 6 feature maps. • Each generated with a different learned filter. • This multiplies the computation. • Also multiplies the data volume. • But a lot more info is obtained! M m N n Q filters image sums & activations Q feature maps 436 Stacking Layers • For multiple layers simplify the notation, omitting flow lines, filters, and sum/activation nodes. • Example Network: We stacked a multiple convolutional layers (each with multiple filters and activations). We have also added some things. Let’s explain. N N   4Q 16 16 N N   4Q 4 4 N N   2Q 2 2 image NxN(x3) N N   4Q 8 8 convolution + activation N  N Q pooling 1 1 2 1 1 K N Q 64 N N   4Q 32 32 K outputs fully connected + activation softmax 437 Multiple Channels • The input is color, so has 3 color channels (such as RGB). • Hence each filter does also: if a filter is P x P, then it is really a 3 x P x P filter. • The diagram remains the same, but the computation is threefold. N N   4Q 16 16 N N   4Q 4 4 N N   2Q 2 2 image NxN(x3) N N   4Q 8 8 convolution + activation 1 1 2 1 1 K N Q 64 N N   4Q 32 32 K outputs fully connected + activation N  N Q pooling softmax 438 Stacked Convolutions • Stacking convolutions allows for small filter kernel (weights). • Since repeated layers “reach” further to become less local. • Similar to repeated linear filtering – except each layer has an activation function. image NxNx3 N  N Q N  N Q 439 Activation Functions • Classic sigmoid functions can be used – but for modern Convnets / CNNs, other activation functions are more effective. • Training multi-layered ConvNets requires backprop, which uses gradient descent. • Convergence problems arise, like the vanishing gradient problem. If the gradient approaches zero, convergence slows or halts! 440 Activation Functions • The most popular is “ReLu,” or Rectified Linear Unit: 0 ; x  0 f(x)   x ; x  0 0 ; x  0 f(x)   1 ; x  0 • These A.F.’s can speed up training many-fold. • Vanishing gradient reduced by not limiting the range (e.g., to [0, 1]) • Another popular one is the Exponential Linear Unit (ELU):  a(e x  1) ; x  0 f(x)   x;x0  f(x)  a ; x  0 f(x)   1; x  0  441 More Activation Functions 442 Pooling Layers • Another way to reduce network complexity that is often used. • Another way to expand the influence of each filter’s receptive field across the image. • Appropriate after the earliest layers when the network focus is no longer on extracting local low-level features. • Instead, the network begins to build higher-level abstractions. 443 Max and Average Pooling • Simple: Partition the most recent feature map(s) into non-overlapping P x P blocks (Stride = P). Usually P = 2. • Max-Pool: Replace each P x P block with a single value max{block}: 12 20 30 4 8 4 12 2 2x2 Max-Pool 20 30 96 37 34 70 37 14 Tends to propagate sharp, sparse features, like edges and contours. 88 96 25 12 • Ave-Pool: Replace each P x P block with a single value ave{block}: 12 20 30 4 8 4 12 2 34 70 37 14 88 96 25 12 2x2 Ave-Pool 13 10 72 22 Tends to propagate a similar, smooth version of the prior layer. • Both help “over-fitting” by providing an abstracted form of the representation, and reducing positional dependence of the representation (translation invariance). 444 Flattening and Fully Connected Layers • The final stages of a CNN are usually fully connected networks or FCNs (the same MLP’s as earlier) are applied. • Since the data has been down-sampled and abstracted, it is manageable with today’s hardware. 1 1 N N   4Q 32 32 2 1  1 K N Q 64 K outputs • The output of the previous layer is first flattened into a 1-D vector, to format it as input to the FCN. Everything is connected to everything, so nothing (such as spatiality) is lost. • The image features have been converted into higher-level abstractions, on which the FC network performs the final inferencing. 445 Softmax • The final FC network generates K outputs y = {y(k); k = 1, …, K}, corresponding to K classes in the problem to be solved. • The softmax (or softargmax*) function converts these into probabilities by mapping K non-normalized outputs of the last FC network to a probability distribution over the K predicted output classes. 1 1 K K outputs softmax • If for image i(j) (train or test) the outputs are yk(j), k = 1, …, K, the softmax outputs are e yk ( j) p k ( j)   M m 1 e y m ( j) for k 1, ..., K. • These sum to one and may be thought of as probabilities. *Suppose y(k0) = max{y(k); k = 1, …, K}. Then arg max{y(k); k = 1, …, K} = {dk-k0); k = 1, …, K} = {0, 0, 0, …., 0, 1, 0, …., 0}. 446 Cross Entropy Loss Function • Assume multi-class classification: Given a set of known ground truth training labels T = {tj; 1 < j < J} of input training images I = {i(j); 1 < j < J}. • K = 1 is the simplest, e.g., face / no face. • Each label takes one of K values tj {1, …, K} (multi-class). For example, in images these could index {dog, cat, bird …}. • Let the output (predicted class probabilities) of the J training images i(j) be given by pk(j), k = 1, …, K. • Then the cross-entropy loss or log-loss between ground truth and predictions is given by 1 J K CE    δ( j, k) log p k ( j) J j1 k 1 • A good way to generate the probabilities pk(j) is softmax. 447 VGG-16 • You have learned all of the elements of a very famous deep learning network or DLN called VGG-16. • Designed for 1000-object classification, has done well in many image analysis contests and remains very popular. • Many variations and deeper ones, but these are the original “hyperparameters”: 1 11000 14  14  512 1 1 4096 56  56  256 image 224  224  64 224x224(x3) • 7  7  512 28  28  512 112  112 128 In VVG nets, the filter size is 3x3 1000 outputs convolution + activation fully connected + activation softmax pooling 448 Comments • Later we will use filter banks and ConvNets extensively … onward to Module 5. 449 Module 5 Image Denoising & Deep Learning      Median, Bilateral, Non-local Mean Filters Wavelet Soft Thresholding BM3D Deep Learning, Transfer Learning Autoencoders, Denoising Networks QUICK INDEX 450 MEDIAN FILTER • The median filter is a nonlinear filtering device that is related to the binary majority filter (Module 2). • Despite being “automatic” (no design), it is very effective for image denoising. • Despite its simplicity, it has an interesting theory that justifies its use. 451 Median Filter • Given image I and window B, the median filtered image is: J = MED[I, B] • Each output is the median of the windowed set: J(i, j) = MED{BI(i, j)} = MED{I(i-m, j-n); (m, n)  B} 452 Properties of the Median Filter • The median filter smooths additive white noise. • The median filter tends not to degrade edges. • The median filter is particularly effective for removing large-amplitude noise impulses. DEMO 453 BILATERAL FILTER 454 BILATERAL FILTER • Another principled approach to image denoising. • It modifies linear filtering. • Like the median filter, it seeks to smooth while retaining (edge) structures. 455 Bilateral Filter Concept • Observation: 2-D linear filtering is a method of weighting by spatial distances [i = (i, j), m = (m, n)]: J G (i ) =  I(m) G  m  i  m • If linear filter G is isotropic, e.g., a 2-D gaussian LPF* G(i, j)  K G exp    i 2  j2  / sG2  then m  i   m-i  +  n-j 2 2 is Euclidean distance. *The value of KG is such that G has unit volume. 456 Bilateral Filter Concept • New idea: Instead of weighting by spatial distance, weight by luminance similarity J H (i ) =  I(m) H  I  m   I  i   m • NOT a linear filter operation. H could be a 1-D gaussian weighting function* H(i)  K H exp  i 2 / sH2  • Not very useful by itself. *The value of KH is such that H has unit area. 457 Bilateral Filter Definition • Combine spatial distance weighting with luminance similarity weighting: J(i ) =  I(m) G  m  i  H  I  m   I  i   m • Computation is straightforward and about twice as expensive as just spatial linear filtering. • However, this can be sped up by pre-computing all values of H for every pair of luminances, and storing them in a look-up table (LUT). 458 Example • Noisy edge filtered by bilateral filter (Tomasi, ICCV 1998). • Uses gaussian LPF and gaussian similarity Before Bilateral weighting 2 pixels to right of edge After 459 Color Example • Can extend to color using 3-D similarity weighting. • If I = [R, G, B], then use H[R(m)-R(i), G(m)-G(i), B(m)-B(i)] where H(i, j, k)  K exp    i 2  j2  k 2  / s2  H Before  H  After 460 Comments on Bilateral Filter • Similar to median filter, emphasizes luminances close in value to current pixel. • No design basis, so best used with care, e.g., retouching or low-level noise smoothing. • The idea of doing linear filtering that is decoupled near edges is very powerful. • This is also the theme of the other methods. 461 NON-LOCAL (NL) MEANS 462 NL Means Concept • Idea: Estimate the current pixel luminance as a weighted average of all pixel luminances • The weights are decided by neighborhood luminance similarity. • The weight changes from pixel to pixel. • In all the following, i = (i, j), m = (m, n), p = (p, q). • The method is expensive but effective. 463 NL Means Concept • Given a window B, compute the luminance similarity of every windowed set BI(i) with every other windowed set BI(m) in the image: W(m, i )  K W exp   BI  i   BI  m  / s2W  where BI  i   BI  m     I  i  p   I  m  p   2 pB   I  i  p, j  q   I  m  p, n  q  (p,q)B 464 2 j Moderately similar i B I  i  Highly similar Moderately similar Highly dissimilar 465 NL Means Definition • Given a window B, and the weighting function W just defined, the NL-means filtered image I is J  i    W  m, i  I  m  m • Thus each pixel value is replaced by a weighted sum of all luminances, where the weight increases with neighborhood similarity. 466 Example Before After Difference 467 Example Before After Difference 468 Comments • NL-Means is very similar in spirit to frame averaging. • The main drawback is the large computation / search. • It can be modified to compare/search less of the image. • It works best when the image contains a lot of redundancy (periodic, large smooth regions, textures). • It can fail at unique image patches. 469 Comparison AWGN Gaussian filtered Bilateral filtered NL-Means 470 BM3D 471 Overview • A complicated algorithm – too much! Uses many of the ideas we have covered. • Uses concepts of NL Means, sparse coding, and more. • Image broken into blocks, blocks then matched by similarity into groups. • Estimates are collaboratively formed for each block within each group (collaborative filtering). • This is done by a transformation of the group and a shrinkage process. • Perhaps the best denoising algorithm extant. Dabov et al, “Image denoising by sparse 3D transform-domain collaborative filtering,” IEEE Trans on Image Processing, Aug 2007. 472 Examples Video 473 Deep Learning Networks 474 Deep Learning • We have already arrived at Deep Learning! VGG-16 is a classical Deep Learning neural net. • Once it was regarded as “very deep,” but not so now. • Notice in the diagram that there are no “handcrafted” features extracted to feed the network. It is an “end-to-end” design, meaning just pixels fed at one end, result at the other. • Deep Learning Networks (DLNs) are just ConvNets, which have been known about since the 1980s. So why the “hubbub”? 475 ImageNet Contest • In the early 2000s, neural nets were still shallow (too much computation) and still fed just a few extracted features. • This changed when Geoffrey Hinton and his students Alex Krizhevsky and Ilya Sutskever published a paper entitled “ImageNet Classification with Deep Convolutional Neural Networks” • In 2012 they showed that an end-to-end 8-layer convnet (5 CNN and 3 FC) significantly outperformed all computer vision image classifiers on the standard ultimate testbed, “ImageNet.” • ImageNet (at the time) contained 15 million human-labeled* images having >20,000 class labels. They trained and tested on a subset of 1.2 Million images having 1000 class labels. • They accomplished this by adopting highly efficient Graphical Processing Units (GPUs) originally designed for graphics/image manipulation. *Labeled using Amazon Turk crowdsource platform 476 AlexNet • Here is the DLN they used (from their paper): • 13  13  384 4096  1 27  27  256 224x224(x3) 4096  1 13  13  256 13  13  384 55  55  96 Details – – – – – – – – 650,000 neurons and 60 million parameters ReLu activation to avoid “overfitting” Stride of 4 at input (input subsampling) Maxpooling of layers 1, 2 and 5 1000-way softmax at output 477 Response normalization to [0, 1] after 1, 2, and 5 New method called “Drop-Out” to avoid overfitting “Data Augmented” by also training on randomly shifted & horizontally flipped images Overfitting • A great problem is the lack of data needed to train large networks and avoiding overfitting. • Overfitting occurs when a network learns a model that is too close to the training data, and therefore does not generalize well to new data. • Usually from too-little data (vs #network parameters) or training for too long. • Some methods of handling: – – – – – Data augmentation Early stopping Dropout Batch normalization Regularization Blue and Red are training data in two classes. Green curve is overfitted. Black is “regularized” to avoid that. 478 Handling Overfitting • There is no simple rule of thumb regarding how much data is needed to train a network with P parameters (although sometimes say 1/10th #parameters at least). • Data Augmentation. Increasing the amount of data artificially (when enough real data is not available). This involves reusing images by flipping, rotating, scaling, and adding noise to create “new” images. Care is needed but it works! • Early stopping. A lot of theory, but usually empirical / ad hoc in application. Simple stop the training when some measure of convergence is observed, typically. • Dropout. Idea: Neighboring neurons rely on each other too much, forming complex co-adaptations that can overfit, especially in parameter-heavy FC layers. Method: For each input, randomly fix P% of neurons to zero. In AlexNet, P = 50. 479 Batch Normalization • Batch normalization. Extends the idea of input normalization which we discussed briefly in context of MLPs. • Suppose the input is i0(1), i0(2), …., i0(P). Then let 1 P μ 0   i 0 (p) P p 1 1 P 2 σ    i 0 (p)  μ 0  P p 1 2 0 • Input normalization is training and testing on î0 (p)  • In batch normalization, this is instead applied i 0 (p)  μ 0 σ 02  ε – At other (possibly all) network layer inputs – Normalizing over all the data (batch) or over subsets (mini-batches) • Many benefits: – – – – Improves convergence Reduces overfitting / improves generalization Greatly reduces vanishing gradient problem Often dropout not needed 480 Global Average Pooling (GAP) • Much of overfitting occurs in the parameter-heavy FC layers. GAP replaces them in a simple way by simply globally averaging each final feature map. • Then input these responses to softmax. • Greatly reduces overfitting while also greatly reducing model size. • VGG-like network with GAP: outputs pooling convolution + activation GAP softmax 481 Deep Image Regression Networks 482 Classification vs Regression • Computer vision (big early application of DLNs) models often classify (what class of object does this image contain?) • Image processing models often create image-sized results like denoised images, compressed images, quality maps, and so on. • Many of these are “image-to-image” transformations. • This generally involves regression instead of classification. • In regression the goal is to predict numerical values (better pixels, distances/ranges, quality etc) from an image input (RGB, luminance, YUV, etc). 483 General Image Regression • Assume a large collection of “before” and “after” images. Call these I = {I1, I2, …., IN} and J = {J1, J2, …., JN}. N could be in the millions. • Most often each “after” image Jn is a changed version of a “before” image In, which are assumed to have a desirable appearance. • The “after” images are the result of processing (intentional or otherwise), such as: – – – – – – – • Noise added Compression by an algorithm Blur occurs Smoke, fog, or dust appear Pieces of the image are lost or cropped Image is dark or night falls Color is lost or distorted The “after” images can be other things like distance, quality, style, but this requires other inputs also (later). 484 Prediction • Basic idea: Predict the original image (remove noise, blur, smoke, compression) using a DLN. • How: Train the DLN on many examples of imperfect images, using original images as labels. Simplest: pixelwise labels. • Result: DLN (hopefully) learns to predict the labels (the desired changed images). Not much thinking, but also not so easy! • Since the result is of the same size as the original images, little or no downsampling used. • Example: 485 Loss Functions • Given “before” and “after” (distortion or change) images I = {I1, I2, …., IP} and J = {J1, J2, …., JP} as before. • The training network produces predictions F(Jn) of In. Optimizing the network requires minimizing a loss like the MSE (or L2 loss): MSE  F  J n   In N 2 M  ε 2  F  J n  ,I n    F J n (i, j)  I n (i, j) 2 i 1 j1 or (L2 loss): N M MAE  F  J n   I n 1 = ε1  F  J n  , I n    F J n (i, j)   I n (i, j) i 1 j1 • Both are common. – L2 is more sensitive to “outliers” (wrong labels) while L1 is robust against them. – L2 optimization is more stable and yields unique solutions. L1 less stable, and nonunique. – L1 optimization often is sparse, yielding weights or “codes” that are mostly zeros. • When the appearance of the result matters, the Structural Similarity Index (SSIM), which is differentiable, is often used: SSIM  F  J n  , I n  486 Residual Nets? • Again assume “before” and “after” images I = {I1, I2, …., IN} and J = {J1, J2, …., JN}. A “residual network” is a simple and intuitive idea. • Suppose: Instead of learning predictions F(Jn) of In, instead learn to predict the residuals R(Jn) = F(Jn)  In b/w the “after” and “before” images. The training residuals are the labels. • The loss would be N M RES-MSE  ε R  R  J n     R  J n    J n (i, j)  I n (i, j)  2 i 1 j1 N M    F  J n (i, j)   I n (i, j)   J n (i, j)  I n (i, j) i 1 j1  2 • Residuals simpler, lower entropy (clustered around 0), easier to predict. • Application Phase: Ooops! The network is trained to predict residuals, which we do not have! 487 ResNet • There is a way to apply the idea! Let the network be trained on the same input (Jn) and output (In) training data. The goal is still to produce a prediction F(Jn)  In, using a loss F  J n   I n • This network learns the same mapping: Figure adapted from the famous paper “Deep Residual Learning for Image Recognition” by He, Zhang, Ren, and Sun • However, the weighting layers instead actually learn the residuals G  J n   F  J n   J n  In  J n • The secret: A “skip connection” whereby Jn bypasses one or more layers, then added before activation. • This can be done every few layers. 488 34-layer (standard and ResNet) 489 ResNet • ResNet is an extremely powerful and popular idea! • One of the highest-citing papers ever! • However, in practice it is used differently for even better results. • The residual idea can be applied throughout the network (every few layers) powerfully reduce overfitting and improve convergence. • The ResNet idea is used in other network designs that use different connectivities, scales, and skip connections, such as DenseNet, and GoogLeNet/Inception. These keep evolving. 490 ResNet • ResNet has made possible very, very deep networks (hundreds of layers deep), which was not possible because of the degradation problem. • The performance of standard networks decrease with great depth, because of vanishing gradients, poor convergence, worse accuracy. • Developers now routinely use “ResNet-50,” “ResNet-101,” and “ResNet-152,” for example. • ResNet models have won nearly every image detection, recognition, segmentation, classification, and regression contest! 491 Transfer Learning 492 Transfer Learning • The biggest problem in deep learning is obtaining enough data. For very deep networks, even when parameter-efficient, very large datasets are needed which are often not available. • As we know they can be trained end-to-end w/o feature computation. Only pixels need be fed, if the network is large enough and the data volume is large enough. However, often there is not nearly enough data. • However, a property of DLNs trained for image processing is their remarkable generality. This is very useful when there is not enough data. • These features can be remarkably generic! Can often use a trained network (with learned feature outputs) to conduct another visual task. • This might involve “fine tuning” the “pre-trained” network and/or adding additional layers (or an SVC/SVR) that are subsequently trained. • Called TRANSFER LEARNING. 493 Transfer Learning (Fine Tuning) Source Task (very large data, e.g. ImageNet) Target Task Task (much less data) + labels pre-train + new labels fine-tune DLN Pretrained DLN (Knowledge) predictions 494 Transfer Learning (New Layers) Source Task (very large data, e.g. ImageNet) Target Task Task (much less data) + labels pre-train DLN (Knowledge) + new labels fine-tune Pretrained DLN New Layers / SVR predictions 495 Transfer Learning • Transfer learning is one of the most common ways to address most image processing / analysis tasks. • Most of which have much more limited data than ImageNet. • Typical a network designer will begin with an AlexNet, VGG-16, ResNet-20, -50, 100, 150, Inception/GoogLeNet, or any other large network. • Some of these now have 1000+ layers! • In application examples we will often see this. 496 DLN Training Environments 497 DLN Training Environments • There are many programming environments for training DLNs. • Some of these have become quite popular. • These include – – – – – Tensorflow Caffe Keras Pytorch Matlab • Currently, Facebook’s Pytorch is favored amongst DLN R&D engineers. 498 Convolutional Autoencoders 499 Autoencoders • An autoencoder is a special variety of neural network (or DLN). • It is generally unsupervised! • They have the same architecture as ordinary MLPs or DLNs, but they can be divided into two parts. • The first part is like an ordinary network: multiple layers of down sampling to achieve efficient representation requiring much less data / dimensionality. • The second part follows with multiple layers of up sampling – generally mirroring the front stage of layers. • The point of transition where efficiency is achieved is the “bottleneck.” • In our world, the output is an image that is intended to be close to or look like the original. Hence the loss function measures this (MSE, SSIM, etc). 500 Convolutional Autoencoder Bottleneck Code or latent representation Input Image convolution + activation Encoder Transformer pooling Decoder Reconstructed Image upsampling 501 Convolutional Autoencoders • The basic idea is to learn an efficient, compact representation of the structure of an image, from which it can be “reconstructed.” • It’s not just the spatial dimensions. In fact down/up sampling might be limited or omitted. The number of filters also matter, i.e., the size of the whole feature space at the bottleneck. • Autoencoders can be – Overcomplete: the number of features in the code exceeds the input data size – Complete: the number of features in the code is the same as the input data size – Undercomplete: the number of features in the code is less than the input data size • All three are very useful for different tasks! However, without further constraints the first two might learn the identity function (not so useful). • All can be useful with additional constraints on the code. • For efficiency, undercomplete autoencoders are of greater interest. 502 Up Sampling • Methods of up sampling include – Nearest neighbor (pixel replication) – Max unpooling (requires recalling where corresp. max pool value came from) – Interpolation (bilinear, bicubic, learned) 1 1 2 2 1 2 1 1 2 2 3 4 3 3 4 4 3 3 4 4 Nearest Neighbor Unpooling 503 Regularized Autoencoders • Regularization is a way of constraining the code at the bottleneck, for various purposes. • This includes avoiding identity function, making the represent efficient / small code, creating high-information features, etc. • Given a network loss function N M ε  F  J n  ,I n    F J n (i, j)   In (i, j) 2 i 1 j1 where In are input images and F(Jn) are corresponding output images (predictions), then modify the loss by appending a penalty on the bottleneck code (activations) Hn: ε  F  J n  ,I n   λ  R  H n  . 504 Sparse Autoencoders • Sparse regularization at the bottleneck is very commonly used: K R  Hn    Hn  k  k 1 which is just the L1 norm of the bottleneck activations. • L1-norm optimization causes the code to be sparse, in the sense that many zero activations are created, leaving a very sparse code of a few non-zero activations. • The concept is very closely related to “lasso” optimization in statistics (L1-L2 optimization). • Actually, this sparsity constraint can be applied at any layer of any DLN to enforce sparse representations (and they are). • An obvious application of these ideas is image compression. 505 Typical Sparse Weights Visualized • These look a lot like the Gabor functions and derivatives of gaussians which are good approximations to the receptive fields of human cortical neurons. 506 Case Study: Denoising Autoencoder (DAE) 507 Denoising Autoencoder (DAE) • A simple task is to adapt an autoencoder to perform denoising. • Instead of optimizing output to be equal or similar to input, it is optimize it to equal a noise-free version of it. • It’s about training: Given many noise-free images {In}, make noisy ones, e.g., add gaussian noise Î n  I n  N n or other kind of noise, multiplicative, signal-dependent, simulated, etc. • The autoencoder can be sparse, often producing better results. 508 Denoising Autoencoder • Training Phase: pristine training images loss L noise noisy training images autoencoder • The network doesn’t have to be an autoencoder, of course. But if so, it is a denoising autoencoder. 509 Denoising Autoencoder • Application Phase: autoencoder noisy images “denoised” images (simulated idea) • Since denoising does not seek an efficient (small) representation, it turns out that overcomplete autoencoders do a better job of denoising than undercomplete ones. 510 Simple Autoencoder Denoising Example • Small, simple example of denoising images with an autoencoder. • It operates on gray-scale 28x28 images. • Architecture: – Encoder: 2 conv layers, ReLu, 2x2 max pooling, 32 3x3 filters each layer (image: 282 = 784 pixels; codesize: 7x7x32 = 1568) – Decoder: basically reverses. Same size corresponding filters, 2x2 upsampling) • Training: • Images (28x28x1) of handprinted numerals 0-9, using 55,000 training samples with added noise, 10,000 test samples. • Trained over 25 epochs using cross-entropy loss. 511 Note the nice convergence of both the training study and the validation study. This is very good since it means that the DAE is generalizing and not overfitting. 512 • Pretty good results for such a simple method, but simple DAEs struggle on “harder” images. More advanced DAEs (“stacked” autoencoders, etc) do better and the field is changing very fast! • This simple implementation and two figures are from S. Malik’s teaching website here (with more details). 513 Case Study: Deep Residual Denoiser 514 Deep Residual Denoiser • This is a very competitive method. • Basic architecture (Denoising CNN, or DnCNN): • It uses the similar concept as the ResNet to learn a residual image (the noise). 515 DnCNN Denoiser • Remember the RES-MSE which we could not use (no residuals available)? N M RES-MSE  ε R  R  J n     R  J n    J n (i, j)  I n (i, j) 2 i 1 j1 N M    F  J n (i, j)   I n (i, j)   J n (i, j)  I n (i, j) i 1 j1 • This time we can have residuals! • In are clean images, and Jn are noisy images. Train to estimate the residuals (noise) Jn  In. 516  2 DnCNN Architecture • Has D layers. In their design the filter size is 3x3. Observe that at layer d the effective span is (2d+1)x(2d+1). • They use D = 17 (35x35 effective largest filters) to be comparable with leading algorithms. • They use ReLu but no pooling / downsampling. • Filters: – Layer 1: 64 filters (3x3xc, c = # colors) – Layers 2 – (D-1): 64 filters (3x3x64 w/batch normalization) – Layer D: c filters (3x3x64) 517 Numerical Predictive Performance • In terms of PSNR. • Four plots: – With/without batch normalization (BN) – With/without residual learning (RL) idea (versus just estimating In) – Clearly shows great benefit of both, especially combined 518 Some Examples • Trained tested on various datasets. See the paper.* Original Noisy Color BM 3D Color DnCNN *Zhang et al, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, Feb. 2017. 519 Some Examples • They also tried it on JPEG blocking “noise,” and compared against state-of-the-art networks “AR-CNN” and “TNRD” (see the paper): Original AR-CNN TNRD DnCNN • The topic of “de-blocking” is also a hot research area. They used this to show the generality of their method. 520 Comments • We now move to a big topic – image compression… onward to Module 6.. 521 Module 6 Image Compression     Lossless Image Coding Lossy Image Coding JPEG Image Compression Standards Deep Compression QUICK INDEX 522 OBJECTIVES OF IMAGE COMPRESSION • Create a compressed image that "looks the same" (when decompressed) but can be stored in a fraction of the space. • Lossless compression means the image can be exactly reconstructed. • Lossy compression means that visually redundant information is removed. The decompressed image "looks" unchanged, but mathematically has lost information. • Image compression is important for: - Reducing image storage space - Reducing image transmission bandwidth 523 Significance for Storage & Transmission This ... workstation sequence of transmitted images HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD HD communication channel: wire airwaves optical fiber etc. ... becomes this. workstation HD sequence of received images Compressed images can be sent at an increased rate: More images per second ! 524 Video phones and video watches as conceived in 1930s 525 Screens are Getting BIGGER The “Flexpai” Roll-up TV 526 527 Lossless Image Compression 528 Image Compression Measures • Bits Per Pixel (BPP) is the average number of bits required to store the gray level of each pixel in I. • Uncompressed image (K is gray scale range): BPP(I) = log2(K) = B • Usually B = log2(256) = 8. 529 Variable-Bit Codes • The number of bits used to code pixels may spatially vary.  • Suppose that I is a compressed version of I. • Let B(i, j) = # of bits used to code pixel I(i, j). Then  BPP( I ) = 1 N-1 M-1 B(i, j)   NM i=0 j=0  • If total number of bits in I is Btotal, then  BPP( I ) = 1 Btotal NM 530 Compression Ratio • The Compression Ratio (CR) is BPP(I) CR =  1 BPP( I ) • Both BPP and CR are used frequently. 531 LOSSLESS IMAGE COMPRESSION • Lossless techniques achieve compression with no loss of information. • The true image can be reconstructed exactly from the coded image. • Lossless coding doesn’t usually achieve high compression but has definite applications: - In combination with lossy compression, multiplying the gains - In applications where information loss is unacceptable. • Lossless compression ratios usually in the range 2:1 ≤ CR ≤ 3:1 but this may vary from image to image. 532 Methods for Lossless Coding • Basically amounts to clever arrangement of the data. • This can be done in many ways and in many domains (DFT, DCT, wavelet, etc) • The most popular methods use variable wordlength coding. 533 Variable Wordlength Coding (Huffman Code) 534 Variable Wordlength Coding (VWC) • Idea: Use variable wordlengths to code gray levels. • Assign short wordlengths to gray levels that occur frequently (redundant gray levels). • Assign long wordlengths to gray levels that occur infrequently. • On the average, the BPP will be reduced. 535 Image Histogram and VWC • Recall the image histogram HI: H I (k) 0 K-1 gray level k • Let L(k) = # of bits (wordlength) used to code graylevel k. Then  1 K-1 BPP(I ) = L(k)H I (k)  NM k=0 • This is the common measure of BPP for VWC. 536 Image Entropy • Recall the normalized histogram values pI (k) = 1 H I (k) ; k = 0,..., K-1 NM so pI(k) = probability of gray level k • The entropy of image I is then K-1 E[I] = -  pI (k) log2 pI (k) k=0 537 Meaning of Entropy • Entropy is a measure of information with nice properties for designing VWC algorithms. • Entropy E[I] measures – Complexity – Information Content – Randomness 538 Maximum Entropy Image • The entropy is maximized when 1 pI (k) = ; k = 0,..., K-1 K corresponding to a flat histogram. Then (K = 2B) K-1 K-1 1 1 1 E[I] = -  log 2 = B   =B K k=0 K k=0 K • Generally, image entropy increases as the histogram is spread out. 539 Minimum Entropy Image • The entropy is minimized when pI (k n ) = 1 for some k n ; 0  k n  K-1 hence pI(km) = 0 for m ≠ n. • This is a constant image and E [I] = 0. • In fact it is always true that 0 ≤ E [I] ≤ B 540 Significance of Entropy • An important theorem limits how much an image can be compressed by a lossless VWC:  BPP( I )  E  I • A VWC is assumed uniquely decodable. • The pI(k) are assumed known at both transmitter and receiver. • Thus an image with a flat histogram: E[I] = B cannot be compressed by a VWC alone. Fortunately, we can often fix this situation by entropy reduction. • Moreover, a constant image need not be sent: E[I] = 0! • The entropy provides a compression lower bound as a target. 541 Image Entropy Reduction • Idea: Compute a new image D with a more compressed histogram than I but with no loss of information. Lossless invertible transformation H I (k) high entropy image H D (k) reduced entropy image • Must be able to recover I exactly from D - no loss of information. • Compressing the histogram (Module 3) won’t work, since information is lost when two gray levels k1 and k2 are mapped to the same new gray level k3. 542 Entropy Reduction by Differencing (Simple DPCM) • Differential pulse-code modulation (DPCM) is effective for lossless image entropy reduction. • Images are mostly smooth: neighboring pixels often have similar values. • Define a difference image D using either 1-D or 2-D differencing: (1-D) D(i, j) = I(i, j) - I(i, j-1) (2-D) D(i, j) = I(i, j) - I(i-1, j) - I(i, j-1) + I(i-1, j-1) for 0 ≤ i ≤ N-1, 1 ≤ j ≤ M-1. • The new histogram HD usually will be more compressed than HI, so that E[D] < E[I] • This is a rule of thumb for images, not a math result. 543 Reversing DPCM • If (1) is used, then I(i, j) = D(i, j) + I(i, j-1) • The first column of I must also be transmitted. • If (2) is used, then I(i, j) = D(i, j) + I(i-1, j) + I(i, j-1) - I(i-1, j-1) where the first row and first column of I must also be transmitted. • The overhead of the first row and column is small, but they can be separately compressed. • Hereafter we will assume an image I has either been histogram compressed by DPCM or doesn’t need to be. DEMO 544 Optimal Variable Wordlength Code • Recall the theoretical lower bound using a VWC: K-1  BPP( I )  E[I ] = -  pI (k) log2 pI (k) k=0 • Observe: In any VWC, coding each gray-level k using wordlength L(k) bits gives an average wordlength:  K-1 BPP( I )=  pI (k) L(k) k=0 • Compare with the above. If L(k) = -log2pI(k) then an optimum code has been found - lower bound attained! 545 Optimal VWC • IF we can find such a VWC such that L(k) = -log2pI(k) for k = 0 ,…, K-1 then the code is optimal. • It is impossible if -log2pI(k) ≠ an integer for some k.  • So: define an optimal code I as one that satisfies:    - BPP( I ) ≤ BPP( I) of any other code I  - BPP( I ) = E[I] if [-log2pI(k)] = integers 546 The Huffman Code • The Huffman algorithm yields an optimum code. • For a set of gray levels {0 ,.., K-1} it gives a set of code words c(k); 0 ≤ k ≤ K-1 such that  BPP( I )= K-1  pI (k) L c(k) k=0 is the smallest possible. 547 Huffman Algorithm • Form a binary tree with branches labeled by the gray-levels km and their probabilities pI (km) : (0) Eliminate any km where pI (km) = 0. (1) Find 2 smallest probabilities pm = pI(km), pn = pI (kn). (2) Replace by pmn = pm + pn to form a node; reduce list by 1 (3) Label the branch for km with (e.g.) '1' and for kn with '0'. (4) Until list has only 1 element (root reached), return to (1). • In step (3), values '1' and '0' are assigned to element pairs (km, kn), elements triples, etc. as the process progresses. 548 Huffman Tree Example 1 • There are K = 8 values {0 ,.., 7} to be assigned codewords: pI(0) = 1/2 pI(1) = 1/8 pI(2) = 1/8 pI(3) = 1/8 pI(4) = 1/16 pI(5) = 1/32 pI(6) = 1/32 pI(7) = 0 • The process creates a binary tree, with values '1' and '0' placed (e.g.) on the right and left branches at each stage: 549 Huffman Tree Example 1 k p I (k) 0 1 2 1 2 1 8 3 1 8 1 8 4 5 1 16 1 32 1/4 1 0 6 1 32 1/16 1 0 1/8 1 0 1/4 1 0 1/2 1 0 1 1 0 c(k) 1 011 010 001 L[c(k)] 1 3 3 3 0001 00001 00000 4 5 5 BPP = Entropy = 2.1875 bits CR = 1.37 : 1 550 Huffman Tree Example 2 • There are K = 8 values {0 ,.., 7} to be assigned codewords: pI(0) = 0.4 pI(1) = 0.08 pI(2) = 0.08 pI(3) = 0.2 pI(4) = 0.12 pI(5) = 0.08 pI(6) = 0.04 pI(7) = 0.00 551 Huffman Tree Example 2 k 0 1 2 3 4 5 6 p I (k) 0.4 0.08 0.08 0.2 0.12 0.08 0.04 0.16 1 0 0.12 1 0 0.36 1 0 0.24 1 0 0.6 1 0 1 0 c(k) 1 0111 L[c(k)] 1 4 0110 010 001 4 3 3 BPP = 2.48 bits Entropy = 2.42 bits CR = 1.2 : 1 0001 0000 4 4 552 Huffman Decoding • The Huffman code is a uniquely decodable code. There is only one interpretation for a series of codewords (series of bits). • Decoding progresses by traversing the tree. 553 Huffman Decoding Example • In the second example, this sequence is received: 00010110101110000010000100010110111010 • It is sequentially examined until a codeword is identified. This continues until all are identified: 0001 0110 1 0111 0000 010 0001 0001 0110 1 1 1 010 • The decoded sequence is: 5 2 0 1 6 3 5 5 2 0 0 0 3 k 0 1 2 3 4 p (k) I 0.4 0.08 0.08 0.2 0.12 5 6 0.08 0.16 0.04 0.12 1 0 1 0 0.36 0.24 1 0 1 0 0.6 1 0 1 0 554 Comments on Huffman Coding • Huffman image compression usually 2:1 ≤ CR ≤ 3:1 • Huffman codes are quite noise-sensitive (much more so than an uncompressed image). • Error-correction coding can improve this but increases increase the coding rate somewhat. • Some Huffman codes aren’t computed from the image statistics - instead assume “typical” frequencies of occurrence of values (gray-level or transform values) – as in JPEG. 555 Exercise • Work through the two examples just given. • But re-order the probabilities in various ways so that they are not “descending.” 556 Lossy Image Compression 557 LOSSY IMAGE CODING • Many approaches proposed. • We will review a few popular approaches. - Block Truncation Coding (BTC) – simple, fast - Discrete Cosine Transform (DCT) Coding and the JPEG Standard - Deep Image Compression 558 Goals of Lossy Coding • To optimize and balance the three C’s – Compression achieved by coding – Computation required by coding and decoding – Cuality of the decompressed image 559 Broad Methodology of Lossy Compression • There have been many proposed lossy compression methods. • The successful broadly (and loosely) follow three steps. (1) Transform image to another domain and/or extract specific features. (Make the data more compressible, like DPCM) (2) Quantize in this domain or those features. (Loss) (3) Efficiently organize and/or entropy code the quantized data. (Lossless) 560 Block Coding of Images • Most lossy methods begin by partitioning the image into sub-blocks that are individually coded. Wavelet methods are an exception. MxM MxM MxM MxM MxM MxM MxM MxM MxM 561 Why Block Coding? • Reason: images are highly nonstationary: different areas of an image may have different properties, e.g., more high or low frequencies, more or less detail etc. • Thus, local coding is more efficient. Wavelet methods provide localization without blocks. • Typical block sizes: 4 x 4, 8 x 8, 16 x 16. 562 Block Truncation Coding (BTC) 563 Block Truncation Coding (BTC) • Fast, but limited compression. • Uses 4 x 4 blocks each containing 16 · 8 = 128 bits. • Each 4 x 4 block is coded identically so no need to index them. • Consider the coding of a single block which we will denote {I(1) ,..., I(16)}. 564 BTC Coding Algorithm (1) Quantize block sample mean: 1 16 I= I(p)  16 p=1 and transmit/store it with B1 bits. (2) Quantize block sample standard deviation: 1 16 2 σI = I(p) I     16 p=1  and transmit/store it with B2 bits. 565 BTC Coding Algorithm • Compute a 16-bit binary block:  1 ; if I(p)  I b(p) =  0 ; if I(p)  I 121 114 56 47 1 1 0 0 37 200 247 255 0 1 1 1 0 0 0 1 0 0 0 1 16 0 12 169 43 5 7 251 I = 98.75 which requires 16 bits to transmit/store. 566 BTC Quantization • The quality/compression of BTC-coded images depends on the quantization of mean and standard deviation. • If B1 = B2 = 8, 32 bits / block are transmitted: CR = 128 / 32 = 4:1 • OK quality if B1 = 6, B2 = 4 (26 bits total): CR = 128 / 26 ≈ 5:1 • Using B1 > B2 is OK: the eye is quite sensitive to the presence of variation, but not to the magnitude of variation: Mach Band Illusion 567 BTC Block Decoding • To form the "decoded" pixels J(1) ,..., J(16): (1) Let Q = number of '1's, P = number of '0's in binary block.  b(p) =1 ; set J(p)  I + σ I /A (2) If   b(p) =0 ; set J(p)  I - σ I  A where A= Q P • It is possible to show that this forces J  I and σ J  σ I 568 BTC Block Decoding Example • In our example: Q = 7, P = 9, A = 0.8819 I = INT 98.75 + 0.5 = 99 σ I = INT 92.95 + 0.5 = 93 • If  b(p) =1 ; set J(p)  99 + INT 93/0.882+0.5  204  b(p) =0 ; set J(p)  99 - INT 93  0.882+0.5 = 17 204 204 17 DEMO 17 17 204 204 204 17 17 17 204 17 17 17 204 Here J  98.8 σ J  77.3 569 Comments on BTC • Attainable compressions by simple BTC in the range 4:1 – 5:1. • Combining with entropy coding of the quantized data, in the range 10:1. • Popular in low-bandwidth, low-complexity applications. Used by the Mars Pathfinder! 570 571 JPEG 572 JPEG • The JPEG Standard remains the most widely-used method of compressing images. • JPEG = Joint Photographic Experts Group, an international standardization committee. • Standardization allows device and software manufacturers to have an agreed-upon format for compressed images. • In this way, everyone can “talk” and “share” images. • The overall JPEG Standard is quite complex, but the core algorithm is based on the Discrete Cosine Transform (DCT). 573 Discrete Cosine Transform (DCT) • The DCT of an N x M image or sub-image: N-1 M-1   2i+1 uπ    2j+1 vπ  I(u, v) = 4C N (u)C M (v) I(i, j) cos  cos      NM 2N 2M i=0 j=0     • The IDCT:   2i+1 uπ    2j+1 vπ    I(i, j) =   C N (u)CM (v) I(u, v) cos  cos    2N 2M u=0 v=0     where  1 ;u=0 2 C N (u) =  574  1 ; u = 1 ,..., N-1 N-1 M-1 DCT Basis Functions • Displayed as 8 x 8 images: 575 DCT vs. DFT • JPEG is based on quantizing and re-organizing block DCT coefficients. • The DCT is very similar to the DFT. Why use the DCT? • First, O(N2logN2) algorithms exist for DCT - slightly faster than DFT: all real, integer-only arithmetic. • The DCT yields better-quality compressed images than the DFT, which suffers from more serious block artifacts. • Reason: The DFT implies the image is N-periodic, whereas the DCT implies the once-reflected image is 2N-periodic. 576 Periodicity Implied by DFT • A length-N (1-D) signal: N • Periodic extension implied by the DFT: N 577 Periodicity Implied by DCT • Periodic extension of reflected signal implied by the DCT: 2N • In fact (good exercise): The DFT of the length-2N reflected signal yields (two periods of) the DCT of the length-N signal. 578 Quality Advantage of DCT Over DFT • The periodic extension of the DFT contains high frequency discontinuities. • These high frequencies must be well-represented in the code, or the edge of the blocks will degrade, creating visible blocking artifacts. • The DCT does NOT have meaningless discontinuities to represent - so less to encode, and so, higher quality for a given CR. 579 580 Overview of JPEG • The commercial industry standard - formulated by the CCIT Joint Photographic Experts Group (JPEG). • Uses the DCT as the central transform. • Overall JPEG algorithm is quite complex. Three components: JPEG baseline system (basic lossy coding) JPEG extended features (12-bit; progressive; arithmetic) JPEG lossless coder 581 JPEG Standard Documentation • The written standard is gigantic. Best resource (Pennebaker and Mitchell): 582 JPEG Baseline Flow Diagram DCT Image I partitioned into 8x8 blocks Quantization Rearrange Data (Zig-Zag) Quantization Tables Huffman Coding Tables DC Coefficients AC Coefficients DPCM Compressed data Entropy Coding RLC 583 JPEG Baseline Algorithm (1) Partition image into 8 x 8 blocks, transform each I (u, v) block using the DCT. Denote these blocks by: k (2) Pointwise divide each block by an 8 x 8 user-defined normalization array Q(u, v). This is stored / transmitted as part of the code. Usually designed using sensitivity properties of human vision. (3) The JPEG committee performed a human study on the visibility of DCT basis functions under a specific viewing model. 584 JPEG Baseline Algorithm (3) Uniformly quantize the result I (u, v)   I Q (u, v) = INT k  0.5  k  Q(u, v)  yielding an integer array with many zeros. 585 JPEG Quantization Example • A block DCT (via integer-only algorithm) I k  1260 –1 –12 –23 –17 –6 –11 –9 –2 –7 –2 0 –1 –1 1 2 0 2 –1 0 0 –3 2 –4 –5 –3 2 1 2 0 –1 –2 2 –3 0 1 0 –1 0 2 –2 0 –1 0 –1 1 2 1 –3 0 –1 0 1 1 1 –1 1 –1 0 0 1 –1 –1 0 • Typical JPEG normalization array: Q 16 12 14 14 18 24 49 72 11 12 13 17 22 35 64 92 10 14 16 22 37 55 78 95 16 24 40 51 61 19 26 58 60 55 24 40 57 69 56 29 51 87 80 62 56 68 109 103 77 64 81 104 113 92 87 103 121 120 101 98 112 100 103 99 586 JPEG Quantization Example • The resulting quantized DCT array: I Qk = 79 0 –1 –2 –1 0 –1 –1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Notice all the zeros. The DC value IkQ (0, 0) is in the upper left corner. • This works because the DCT has good energy compaction, leading to a sparse image representation. 587 Data Re-Arrangement • Rearrange quantized AC values I Q (u, v) ; (u, v)  (0, 0) k • This array contains mostly zeros, especially at high frequencies. So, rearrange the array into a 1-D vector using zig-zag ordering: 0 2 3 9 10 20 21 35 1 5 6 14 15 27 28 4 7 13 16 26 29 42 8 12 17 25 39 41 43 11 18 24 31 40 44 53 19 23 32 39 45 52 54 22 33 38 46 51 55 60 34 37 47 50 56 59 61 36 48 49 57 58 62 63 588 Data Re-Arrangement Example • Reordered quantized block from the previous example: [ 79 0 -2 -1 -1 -1 0 0 -1 (55 0’s) ] • Many zeros. 589 Handling of DC Coefficients • Apply simple DPCM to the DC values IkQ (0, 0) between adjacent blocks to reduce the DC entropy. • The difference between the current-block DC value and the left-adjacent-block DC value is found: e(k) = I Q (0, 0) - I Q (0, 0) k k-1 • The differences e(k) are losslessly coded by a lossless JPEG Huffman coder (with agreed-upon table). • The first column of DC values must be retained to allow reconstruction. 590 Huffman Coding of DC Differences • A Huffman Code is generated and stored/sent indicating which category (SSSS) the DC value falls in (See Pennebaker and Mitchell for the Huffman codes). • SSSS is also the # bits allocated to code the DPCM differences and sign. 591 Run-Length Coding of AC Coefficients • The AC vector contains many zeros. • By using RLC considerable compression is attained. • The AC vector is converted into 2-tuples (Skip, Value), where Skip = number of zeros preceding a non-zero value Value = the following non-zero value. • When final non-zero value encountered, send (0,0) as end-of-block. 592 Huffman Coding of AC Values • • • • The AC pairs (Skip, Value) are coded using another JPEG Huffman coder. A Huffman code represents which category (SSSS) the AC pair falls in. RRRR bits are used to represent the runlength of zeros, SSSS additional bits are used to represent the AC magnitude and sign. See Pennebaker and Mitchell for details! 593 JPEG Decoding • Decoding is accomplished by reversing the Huffman coding, RLC and DPCM coding to recreate I Qk • Then multiply by the normalization array to create the lossy DCT  Q I lossy = Q  I k k • The decoded image block is the IDCT of the result lossy   lossy  Ik = IDCT I k   594 JPEG Decoding • The overall compressed image is recreated by putting together the compressed 8 x 8 pieces:  Ilossy =  Ilossy k  • The compressions that can be attained range over: 8 : 1 (very high quality) 16 : 1 (good quality) 32 : 1 (poor quality for most applications) 595 JPEG Examples Original 512 x 512 Barbara 596 16 : 1 597 32 : 1 598 64 : 1 599 Original 512 x 512 Boats 600 16 : 1 601 32 : 1 602 64 : 1 DEMO 603 COLOR JPEG • Basically, RGB image is converted to YCrCb image (where Y = intensity, CrCb = chrominances) • Each channel Y, Cr, Cb is separately JPEG-coded (no cross-channel coding) • Chrominance images are subsampled first. This is a strong form of “pre-compression”! • Different quantization / normalization table is used. 604 Digital Color IMages • YCrCb is the modern color space used for digital images and videos. • Similar but simpler definition: Y = 0.299R + 0.587G + 0.114B Cr = R  Y Cb = B  Y • Used in modern image and video codecs like JPEG and H.264. • Why use YCrCb? Reduced bandwidth. Chrominance info can be sent in a fraction of the bandwidth of luminance info. • In addition to color information being “lower bandwidth,” the chrominance components are entropy reduced. DEMO 605 Chroma Sampling 606 Chroma Sampling • Since color is a regional attribute of space, it can be sampled more heavily than luminance. • Image and video codecs almost invariably subsample the chromatic components prior to compression. • While not always referred in this way, this is really the first step of compression. • Taken together, color differencing to reduce entropy, and color sampling both significantly improve image compression performance. 607 Chroma Sampling Formats • Modern chroma sampling formats:  4:4:4 – sampling rates of Y, Cr, Cb are the same  4:2:2 – sampling rates of Cr, Cb are ½ that of Y  4:1:1 – sampling rates of Cr, Cb are ¼ that of Y  4:2:0 – sampling rates of Cr, Cb are also ¼ that of Y 608 Y-Cb-Cr Sampling • Used in JPEG and MPEG. = Cr, Cb samples = Y samples 4:4:4 4:2:2 4:1:1 4:2:0 (JPEG) • Note: The chroma 4:2:0 samples are just calculated for storage or transport. For display, 4:2:2, 4:1:1, and 4:2:0 samples are interpolated back to 4:4:4. 609 RGB  YCrCb Examples RGB Y Cr Cb YCbCr 4:2:0 610 RGB  YCrCb Examples RGB Cr Y YCbCr 4:2:0 Cb 611 RGB  YCrCb Examples Cr RGB Y YCbCr 4:2:0 Cb 612 RGB  YCrCb Examples RGB Cr Y YCbCr 4:2:0 Cb 613 JPEG Color Quantization Table Q 17 18 24 47 99 99 99 99 18 21 26 66 99 99 99 99 24 26 56 99 99 99 99 99 47 66 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 • Sparser sampling and more severe quantization of the chrominance values is acceptable. • Color is primarily a regional attribute and contains less detail information. 614 Color JPEG Examples 2.6:1 23:1 15:1 46:1 144:1 615 Wavelet Image Compression 616 Wavelet Image Coding • Idea: Compress images in the wavelet domain. • Image blocks are not used. Hence: no blocking artifacts! • Forms the basis of the JPEG2000 standard. • Also requires perfect reconstruction filter banks. 617 Compression Via DWT H0 D U G0 H1 D U G1 Image I S HP -1 D U  I GP -1 Coefficient quantization Filters Hp form a orthogonal or bi-orthogonal basis with reconstruction filters Gp. 618 Wavelet Decompositions • Recall that wavelet filters separate or decompose the image into high- and low-frequency bands, called subbands. Each frequency band can be coded separately. (0, 0) Low, Low High, Low Low, High High, High • These are called quadrature filters. 619 Wavelet Decompositions • Typically frequency division is done iteratively on the lowest-frequency bands: (0, 0) (0, 0) LLHL HL LLLH LLHH LH HH "Pyramid" Wavelet Transform "Tree-Structured" Wavelet Transform 620 Wavelet Filter Hierarchy • Filter outputs are wavelet coefficients. Low • Subsampling yields NM nonredundant coefficients. The image can be exactly reconstructed from them. • Idea: Wavelet coefficients quantized (like JPEG): higher frequency coefficients quantized more severely. • The JPEG2000 Standard is wavelet based. Image I Low High H0 H1 2 2 Low High H0 H1 2 2 High H0 H1 2 2 621 Bi-Orthogonal Wavelets • Recall the perfect reconstruction condition: H 0 (ω)G 0 (ω) + H1 (ω)G1 (ω) = 2 for all 0  ω  p (DSFT)      (u) + H  (u) = 2 for all 0  u  N-1 (DFT)  (u)G  (u)G hence also H 0 0 1 1 and H 0 (ω+p)G 0 (ω) + H1 (ω+p)G1 (ω) = 0 for all 0  ω  p (DSFT)      (u) + H  (u) = 0 for all 0  u  N-1 (DFT)  (u+N/2)G  (u+N/2)G hence also H 0 0 1 1 • The filters H0, H1, G0, G1 can be either orthogonal quadrature mirror filters (one filter H0 specifies all filters) • Or H0 and H1 can be specified jointly which implies a biorthogonal DWT basis expansion. 622 Bi-Orthogonal Wavelets • Bi-orthogonal wavelets are useful since they can be made even-symmetric. • This is important because much better results can be obtained in image compression. • Even though they do not have as good energy compaction as orthogonal wavelets. 623 Bi-Orthogonal Wavelets • Wavelet based image compression usually operates on large image blocks (e.g., 64x64 in JPEG2000) • It is much more efficient to realize block DWTs using cyclic convolutions (shorter). • As with the DCT, symmetric extended signals (without discontinuities) give better results than periodic extended. • Symmetric wavelets give better approximations to symmetric extended signals. • There are standard bi-orthogonal wavelets. 624 Daubechies 9/7 Bi-Orthogonal Wavelets Analysis LP Analysis HP Synthesis LP Synthesis HP H0(0) = 0.026749 H0(1) = -0.016864 H0(2) = -0.078223 H0(3) = 0.266864 H0(4) = 0.602949 H0(5) = 0.266864 H0(6) = -0.078223 H0(7) = -0.016864 H0(8) = 0.026749 H0(0) = 0 H1(1) = 0.091272 H1(2) = -0.05754 H1(3) = -0.59127 H1(4) = 1.115087 H1(5) = -0.59127 H1(6) = -0.05754 H1(7) = 0.091272 H1(8) = 0 G0(0) = 0 G0(1) = -0.091272 G0(2) = -0.05754 G0(3) = 0.59127 G0(4) = 1.115087 G0(5) = 0.59127 G0(6) = -0.05754 G0(7) = -0.091272 G0(8) = 0 G1(0) = 0.026749 G1(1) = 0.016864 G1(2) = -0.078223 G1(3) = -0.266864 G1(4) = 0.602949 G1(5) = -0.266864 G1(6) = -0.078223 G1(7) = 0.016864 G1(8) = 0.026749 These are used in the JPEG2000 standard. 625 D9/7 DWT of an Image 626 Zero-Tree Encoding Concept • There are many proposed wavelet-based image compression techniques. Many are complex. • A simple concept is zero-tree encoding: Similar to JPEG, scan from lower to higher frequencies and runlength code the zero coefficients. (0, 0) LLHL HL LLLH LLHH LH HH Zig-zag scanning to find zeroed coefficients 627 Wavelet Decoding • Decoding requires – Decompression of the wavelet coefficients – Reconstructing the image from the decoded coefficients: 2 2 G0 G1 S 2 2 G0 G1 S 2 2 G0 G1 S  I code 628 JPEG2000 Wavelet Compression Example Original 512 x 512 Barbara 629 16 : 1 630 32 : 1 631 64 : 1 632 128 : 1 633 Original 512 x 512 Boats 634 16 : 1 635 32 : 1 636 64 : 1 637 128 : 1 638 Comments on Wavelet Compression • This has been a watered-down exposition of wavelet coding! • Reason: there was never much adoption of JPEG2000. • Why? Much more complex than “old” JPEG. A consideration in an era of cheap memory. • Questions of propriety. • Performance primarily better in the high compression regime. 639 Case Study: Deep Image Compression 640 Deep Perceptual Image Compression • From Balle et al, find it here.* • An autoencoder approach with some perceptual twists. • Main differences: – Use perceptually-motivated “divisive normalization transform” instead of ReLu or other usual activation function. – Uniformly quantize the code at the bottleneck (simulated during training) – Add entropy-minimization term to the loss function *Balle et al, “End-to-end optimized image compression,” Arxiv 16611.01704v3, Mar. 2017. 641 DNT 2x2 Downsample 192 x 5 x 5 Convolution DNT 2x2 Downsample 192 x 5 x 5 Convolution DNT 4x4 Downsample image 256 x 9 x 9 Convolution Basic Compression Architecture Q entropy code C • The uniform quantizer is implemented in two ways. During testing, it simply rounds to integer values. No weighting function. • During training, since quantization is highly discontinuous, an additive model is used. • Quantization is modeled as additive uniform white noise U[-0.5, 0.5]. • Final code C accomplished by entropy coder CABAC (more complex than Huffman) 642 Divisive Normalization Transform (DNT) • Basic vision science model of retinal (or cortical) neural normalization for decades. • Basically, it is a spatial normalization of neural outputs by neighboring neural outputs. There are many variations of this model. • Occurs throughout sensory nervous system, as a way of adaptive gain control (AGC). • Good model of the outputs of retinal ganglion neurons. 643 Divisive Normalization Transform (DNT) • Suppose that wk,i(m, n) are outputs of filter i in neural layer k (downsampled here). • Normalize wk,i(m, n) by dividing by neighboring values from other channels j. 644 IDNT 4x4 Upsample 192 x 5 x 5 Convolution IDNT 4x4 Upsample IDNT 4x4 Upsample 192 x 5 x 5 Convolution decoded image 256 x 9 x 9 Convolution Decompression and Loss entropy decode C • The “inverse” DNT is an approximation (Appendix of paper) • The loss function is defined as a weighted sum of the MSE (b/w the original image In, and the compressed image Jn) and the entropy of the code Cn. ε  J n ,I n   λ  E  Cn  . 645 Compression Example • Many others in the paper. 646 Example: JPEG 647 Example: Perceptual Autoencoder 648 Example: JPEG 2000 649 Example: Rate-Distortion Curves • Very promising. • Might see a new “JPEG” based on DLNs sometime! • Many other competing deep compressors • A hot research topic. 650 Comments • So far we have been processing images – now we will try to understand them – image analysis … onward to Module 7. 651 Module 7 Image Analysis I • • • • Reference Image Quality Prediction No-Reference Image Quality Prediction Deep Image Quality Edge Detection QUICK INDEX 652 IMAGE QUALITY ASSESSMENT 653 IMAGE QUALITY ASSESSMENT • What determines the quality of an image? • The ultimate receiver is the human eye – subjective judgement is all that matters. • But how can this be determined by an algorithm? How can it be quantified? 654 One Method of Subjective Quality Assessment ….. 655 An Important Problem! 656 Overall Natural Image Communication System Front-end digital processing channel Back-end digital processing Mapping Perceptual & image display signal Natural image Sensing & signal digitizing  The Natural Image Transmitter The Image Channel  The Natural Image Receiver 657 Sources of Image Distortion Front-end digital processing channel Back-end digital processing Mapping Perceptual & image display signal Natural image Sensing & signal digitizing  The Natural ImageTransmitter The Image Channel  The Natural Image Receiver 658 Image Distortions • Many distortions commonly occur – often in combination. – – – – – – – Blocking artifacts (compression) Ringing (compression) Mosaicking (block mismatches) False contouring (quantization) Blur (acquisition or compression) Additive Noise (acquisition or channel) Etc 659 Reference IQA 660 “Reference” Image QA • An “original” high quality image is presumed to be known. • Very useful for assessing effectiveness of image compression and communication algorithms. • There exist effective algorithms 661 MSE / PSNR • The mean-squared error (MSE) is the long-standing traditional image QA method. • Given original image I and observed image J: 1 N-1 M-1 2 MSE(I, J ) =  I(i, j) - J(i, j)   NM i=0 j=0 • The Peak Signal-to-Noise Ratio (PSNR) is: L2 PSNR(I, J ) = 10 log10 MSE(I, J ) where L is the range of allowable gray scales (255 for 8 bits). 662 Advantages of MSE/PSNR • Computationally simple. • Analytically easy to work with. • Easy to optimize algorithms w.r.t. MSE. • Effective in high SNR situations. 663 Disadvantages of MSE/PSNR • Very poor correlation with human visual perception! 664 Einstein w/different distortions: (a) (b) MSE = 309 (c) MSE = 306 (d) MSE = 313 (e) MSE = 309 (f) MSE = 308 (g) MSE = 309 (a) original image (b) mean luminance shift (c) contrast stretch (d) impulse noise (e) Gaussian noise (f) Blur (g) JPEG compression (h) spatial shift (to the left) (i) spatial scaling (zoom out) (j) rotation (CCW). Images (b)-(g) have nearly identical MSEs but very different visual quality. 665 (h) MSE = 871 (i) MSE = 694 (j) MSE = 590 Human Subjectivity • Human opinion is the ultimate gauge of image quality. • Measuring true image quality requires asking many subjects to view images under calibrated test conditions. • The resulting Mean Opinion Scores (MOS) are then correlated with QA algorithm performance. • For decades, MSE had little competition despite 40 years of research! 666 Structural Similarity 667 Structural Similarity-Based Models • A modern, successful approach: Measure loss of structure in a distorted image. • Basic idea: Combine local measurements of similarity of luminance, contrast, structure into a local measure of quality. • Perform a weighted average of the local measure across the image. 668 An Aside: Weber’s Law • Weber’s Law: The noticeability of a change in a perceptual stimulus S (weight, brightness, loudness, taste, odor, pain) depends on the percent or ratio of change. • It may not be noticeable at all unless it exceeds some threshold: ΔS S > ε. called a just noticeable difference (JND). 669 Structural Similarity Index (SSIM) • The SSIM Index expresses the similarity of I and J at a point SSIM I,J  i, j = LI,J  i, j  CI,J  i, j  SI,J  i, j where LI,J(i, j) is a measure of local luminance similarity CI,J(i, j) is a measure of local contrast similarity SI,J(i, j) is a measure of local structure similarity 670 Comparing Two Numbers • The DICE index: Given two positive numbers A and B: 2AB DICE(A,B) = 2 A + B2 then 0 < DICE(A, B) < 1. • Avoid divide-by-zero: 2AB + ε DICE(A,B) = 2 A + B2 +ε 671 A=B 672 Luminance Similarity • Luminance similarity 2μ I (i, j)μ J (i, j) + C1 2μ Iμ J + C1 LI,J (i, j) = 2 = 2 2 μ I (i, j) + μ J (i, j) + C1 μ I + μ 2J + C1 where μ I (i, j) = P Q   w(p, q)I(i+p, j+q) p=-P q=-Q P Q   w(p, q) = 1 p=-P q=-Q w(p, q) is an isotropic, unit area weighting function and C1 is a stabilizing constant (later) 673 μI LI , J μJ 674 Luminance Masking • Weber's Perceptual Law (applied to local average luminance): Λ ΔL > τ. L AVE • Luminance term embodies Weber masking: let C1 = 0, then 2μ μ in a patch LI, J  2 I fJ2 μI + μJ but if a patch luminance of the test image differs only by L from the corresponding reference image patch: μ J  μ I  ΔL then taking Lave = mI yields: LI,J  2  L + 1   L + 11 • Thus SSIM’s luminance term depends approximately just on the ratio L. Thus, unless it is not true that L << Lave, then LI ,J  1. 675 Contrast Similarity • Contrast similarity CI,J (i, j) = 2σ I (i, j)σ J (i, j) + C2 2σ I σ J + C2 = σ 2I (i, j) + σ 2J (i, j) + C2 σ 2I + σ 2J + C2 where σ I (i, j) = P Q   p=-P q=-Q w(p, q)  I(i+p, j+q)-μ I (i, j) 2 and C2 is a stabilizing constant. 676 σI CI , J σJ 677 Contrast Masking • The contrast term includes divisive normalization by local image energy (both test and reference). • Suppose a local patch contrast changes in the test image relative to the corresponding reference patch: σ J  σI  σ • Then CI, J  2   + 1    + 11 where Θ Δσ . σI which is a Weber view of contrast masking. 678 Contrast Masking 679 Structural Similarity • Structural similarity σIJ (i, j) + C3 σIJ + C3 SI , J (i, j) = = σI (i, j)σ J (i, j) + C3 σI σ J + C3 where σ IJ (i, j) = P Q   p=-P q=-Q w(p, q)  I(i+p, j+q)-μ I (i, j) J(i+p, j+q)-μ J (i, j) and C3 is a stabilizing constant. 680 σI, J σI SI , J σJ 681 SSIM Index Flow Diagram Reference f Test f̂ 682 Properties of SSIM Important properties: (1) Symmetry: SSIMI,J(i, j) = SSIMJ,I(i, j) (2) Boundedness: 0 < SSIMI,J(i, j) < 1 (3) Unique Maximum: SSIMI,J(i, j) = 1 if and only if the images are locally identical: I(i+p, j+q) = J(i+p, j+q) for –P < p < P, -Q < q < Q. 683 Stabilizing Constants • C1, C2 and C3 are stabilizers – in case the local means or contrasts are very small. • For gray-scale range 0-255, C1 = (0.01·255)2, C2 = (0.03·255)2, C3 = C2/2 work well and robustly. 684 Structural Similarity Index • When C3 = C2/2 then the SSIM index simplifies to (Exercise) SSIM I ,J  2μ Iμ J + C1  2σIJ + C2   2  2  2 2  μ I + μ J + C1  σI + σ J + C2  which is the most common form of the SSIM index. • If C1 = C2 = 0, then it is the “Universal Quality Index” (UQI) UQII ,J  2μ Iμ J  2σ IJ   2 2  2 2   μ I + μ J  σ I + σ J  which does not predict quality as well as SSIM and is more unstable. 685 SSIM Map • Displaying SSIMI,J(i, j) as an image is called a SSIM Map. It is an effective way of visualizing where the images I, J differ. • The SSIM map depicts where the quality of one image is flawed relative to the other. 686 (a) (b) (c) (d) (a) reference image; (b) JPEG compressed; (c) absolute difference; (d) SSIM Map. 687 (a) (b) (c) (d) (a) original image; (b) additive white Gaussian noise; (c) absolute difference; (d) SSIM Map. 688 Mean SSIM • If SSIMI,J(i, j) is averaged over the image, then a single scalar metric is arrived at. The Mean SSIM is  1  N-1 M-1 SSIM(I,J ) =     SSIM I,J (i, j)  NM  i=0 j=0 • The Mean SSIM correlates extraordinarily well with human response as measured in large human studies as Mean Opinion Score (MOS). 689 SSIM vs. MOS On a broad database of images distorted by jpeg, jpeg2000, white noise, gaussian blur, and fast fading noise. Curve is best fitting logistic function a 1+be -t/τ 1+βe -t/τ What is important is that the data cluster closely about the curve. . 690 SSIM vs. PSNR Scatter Plots 691 Mean SSIM Examples (a) (b) MSE = 313 SSIM = 0.730 (c) MSE = 309 SSIM = 0.576 (d) MSE = 308 SSIM = 0.641 (e) MSE = 309 SSIM = 0.580 Einstein altered by different distortions. (a) reference image; (b) impulse noise; (c) Gaussian noise; (d) blur; (e) JPEG compression. DEMO 692 Multi-Scale SSIM (MS-SSIM) Best existing algorithm 693 Multi-Scale SSIM • All terms combined in a product, possibly with exponent weights (Q = coarsest scale): MS-SSIM I ,J =  LI ,J  αQ Q β q   C   I,J  SI,J  γq q=1 • The exponents used vary, but a common choice is* 1 = 1 = 0.0448 2 = 2 = 0.2856 3 = 3 = 0.3001 4 = 4 = 0.2363 5 = 5 = 5 = 0.1333 Based on a small human study of distorted image comparisons. Note the shape of the exponent ‘curve’: bad scores are penalized more at mid-scales (frequencies), similar to the shape of the CSF. 694 European Study of Image Quality Assessment Algorithms • A three-university European study conducted across three countries (Finland, Italy, and Ukraine) gathered and analyzed more than 250,000 human judgments (MOS) and 17 distortion types • Known as Tampere Image Database (TID): http://www.ponomarenko.info/tid2008.htm 695 TID Study 696 MS-SSIM / SSIM Usage • Cable/Internet: Streaming TV providers Netflix, AT&T, Comcast, Discovery, Stars, NBC, FOX, Showtime, Turner, PBS, etc etc use SSIM to control streaming video quality from cable head-end and the Cloud. • Broadcast: Huge international operators like British Telecom and Comcast continuously run MS-SSIM on many dozens of live HD channels in real-time to control broadcast (through the air) encoding. • Satellite: Used worldwide: the Sky companies (Sky Brazil, Sky Italy, BskyB UK, Tata Sky India), Nine and Telstra in Australia, Oi and TV Globo in Brazil (many more) to monitor and control picture quality on their channel line-ups. • Discs: Used to control the encoding of video onto DVDs and Blu-Rays (Technicolor, others) • Equipment: Broadcast encoders made by Cisco, Motorola, Ericson, Harmonic, Envivio, Intel, TI, etc etc … all rely on SSIM for product control (best possible picture quality) 697 Streaming Video Pipeline Video Encoding 1 - Video Source Cable Satellite 4/5G WiFi 2 - Video Transmit 3 - Video Receiver Perceptual Encoder Control Encode Inspection (SSIM) Video Encoding Cable Satellite 4/5G WiFi ADJUST ENCODE RAW VIDEO H264/HEVC VIDEO 2015 Primetime Emmy Award Video on SSIM 700 No-Reference IQA 701 Reference vs. No-Reference “Reference” IQA algorithms require that an “original,” presumably high quality image be available for comparison. Reference image “Reference” QA algorithm Hence they are really “perceptual video fidelity” algorithms. Distorted image “No-reference” QA algorithm “No-reference” IQA models assume no “original image” is available to compare. Hence they are pure “perceptual video quality” algorithms. 702 Blind or No-Reference IQA • No-Reference (or “Blind”) QA – Wherein there is no reference image, nor other information available. The image is assessed based on its individual appearance. 703 Does this image have predictable distortion artifacts? 704 Blind (NR) Image QA (advanced topic) • A very hard problem. An algorithm must answer “What is the quality of this image” with only the image as input. • Complicated image content, spatial distribution of distortion, aesthetics, etc. • There is recent significant progress on this problem, using: – Statistical image models – Machine learning methods • The first of these we can cover. The other is beyond this course. 705 Is this image of good quality? Why or why not? 706 How about these? 707 And this one? 708 Which image is of better quality? 709 Case Study: No-Reference IQA 710 BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator) • A machine learning-based natural image statistics model that holds with great reliability for natural images that are not distorted. • Designing algorithms Given image I the meansubtracted contrast normalized (MSCN) image J is: J(i, j) = m(i, j)   k=-K K  L l=-L w(k,l)I(i-k, j-l) where w(i, j) is a unit-volume gaussian-like weighting function I(i, j)  m(i, j) s(i, j)  1 s(i, j)   K k=-K  L l=-L w(k,l)  I(i-k, j-l)-m(i-k, j-l) 2 711 Natural Image Model • For natural images, the MSCN image is invariably very close to unit-normal gaussian and highly decorrelated! 1 J(i, j)  exp  a 2 / 2  2p • Why is this important? Because the MSCN coefficients of distorted images are usually not very gaussian and can have significant spatial correlation. 712 I I m s 1 Boundary effects Histogram (normalized) of MSCN values 713 Image Scatter Plot • Plots of adjacent pixels against each other: I I m s 1 714 Distortion Statistics • Common distortions change gaussianity: • But distorted MSCN values can be wellmodeled as following a generalized gaussian distribution (GGD) J distorted (i, j)  D1 exp   a / s  715  Product Model • Products of adjacent MSCN values should have a certain shape if they are uncorrelated: J(i, j)  J(i-1, j+1)  D2 K 0  a  K0 = modified Bessel function of the second kind • This happens to be infinite at the origin. • It does not fit distorted image data (adjacent products). 716 Distortions Introduce Correlation • General model for both distorted and undistorted MSCN products: asymmetric generalized gaussian (AGGD). exp  -  a/s    ; a < 0 L    J(i, j)  J(i-1, j+1)  D3     exp -  a/sR   ; a  0    • When no distortion, expect sL = sR. • Correlations from distortion causes asymmetry since adjacent MSCN values are more likely to be the same sign. 717 Product Histograms 718 BRISQUE Features • Given an image, compute the histogram of the MSCN coefficients. Fit with GGD, estimate , s (2 features) • Compute products of adjacent pixels along four orientations • Compute histograms of all four types of products, and for each estimate m, , sL, sR (16 features) 18 features 719 Multiscale • BRISQUE works well using features computed from just two scales. Image I BRISQUE feature extraction Multiscale BRISQUE features Gs 2 BRISQUE feature extraction LPF 36 TOTAL features 720 Training data Learning Machine (Support Vector Regression) 36 features target feature (MOS) 721 Concept of Machine Learning • In our context, plot BRISQUE features and MOS in a highdimensional space (36 + 1). • Find a good separation between the clusters of data that are formed. Or simply regress on (curve fit) them. The machine learns what features values to associate with MOS. • Sometimes the number of clusters (number of distortion types) is known, when training on data with distortion labels. • SVR trick: Plot features in a much higher dimensional space (spreading clusters out) then perform classification/regression. 722 Support Vector Regression Nonlinear mapping to a high-dimensional space. Most classifiers would have a hard time separating these, even in a low 2D space. 723 Application New image I (distorted or not) Trained Learning Machine (SVR w/gaussian distance weight*) Predicted human opinion (MOS) Linear correlation coefficient, 1000 train-test random divisions of the LIVE Image Quality Database 724 *Also called Radial Basis Function in ML literature The Algorithm Runs in < 1 second in simple Matlab implementation (once loaded). This variation of BRISQUE also predicts the distortion type. DEMO 725 Applications of QA Algorithms • Assessing the quality of algorithm results, such as denoising, deblur, compression, enhancement, watermarking, etc. A huge field of applications. • Controlling the quality of images and videos online, (Netflix, YouTube, Facebook, Flickr, Google, etc etc), cable TV, digital cameras, cell phones/tablets, etc etc • Designing algorithms using QA algorithms such as SSIM or BRISQUE instead of the old MSE. The hope is that image processing algorithms will perform much better! This is also a huge field. 726 Case Study: Deep Image Quality Prediction 727 Deep Image Quality Prediction • Often-times the innovation is to collect enough sufficiently representative data. This can be very hard. • This is true in the field of image quality, where the labels are human scores of perceived image quality. • These are judgments rather than binary decisions (as in ImageNet). Collecting scores on an image takes about 10s. • Getting enough human subjects (need thousands) is especially challenging. Laboratory studies cannot accomplish this. 728 IQA Database History • First public subjective picture quality database: the LIVE Image Quality Database (“LIVE IQA,” 2003). • Live IQA: About 800 distorted images in five distortion categories (noise, blur, JPEG, JPEG2000, packet loss) • Rated by about subjects yielding about 25,000 human scores. • Still heavily used. Most leading IQA models (SSIM, VIF, FSIM) have been developed and tested on LIVE IQA. • Many others have followed LIVE IQA using similar ideas. • All are far too small and too-synthetic (distortions not real). 729 Drawbacks of Legacy Databases ~ 25 pristine images Singly distorted Images Single Synthetic distortions • Limited variety of image content (<30 pristine images). • Don’t model real world distortions in billions of mobile camera/social media images. Synthetic Distortions vs Authentic Distortions 730 LIVE “In the Wild” Image Quality Challenge Database 731 Where are the Images From? • More than 1100 images collected by hundreds of people in the US and Korea. All are authentically distorted. Wide variety of image content: pictures of people, objects, indoor, and outdoor scenes. Authentic distortion types, mixtures, and severities. Includes day/night and over / underexposed images. 732 Old Way Controlled laboratory settings • Fixed display device. • Fixed display resolution. • Fixed viewing distance. • Controlled ambient illumination conditions. • Skewed subject sample – typically undergraduate and graduate students. New Way LIVE in the Wild Challenge Database 733 Crowdsourcing 734 State-of-the-Art Blind IQA Models on LIVE Challenge LIVE IQA DB LIVE Challenge DB 735 New vs. Old All prior benchmark databases Authentic, real world distortions Highly varied illumination artifacts Diverse source capturing devices Mixtures of distortions Reference images Uncontrolled view conditions for the subjective study       In the Wild Mobile Picture Quality Database       Paper here: Ghadiyaram et al, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Transactions on Image Processing, January 2016. 736 Deep NR IQA Models • Still not big enough for end-to-end design of really big networks, like ResNet-50. These types of networks require millions of datum, or they overfit. • Such studies are planned / in progress but are very expensive! • However, pretty good results are attained using finetuning. • In a broad study it was found that pre-trained deep learners can achieve better results than existing shallow perceptual predictors. • But not near human performance – a very hard problem, like object recognition. 737 Tested Deep NR IQA Models • AlexNet + SVR: Pretrained on ImageNet, outputs of 6th FC layer (4096) fed to SVR, fine-tuned on LIVE Challenge. AlexNet has 62M parameters. • ResNet-50 + SVR: Pretrained on ImageNet, average-pooled features (2048) fed to SVR, fine-tuned on LIVE Challenge. ResNet-50 has 26M parameters. • AlexNet + fine-tuning: Pretrained on ImageNet, dropout added before last FC layer, 8 epochs of training on LIVE Challenge. • ResNet-50 + fine-tuning: Pretrained on ImageNet, dropout added before last FC layer, 6 epochs of training on LIVE Challenge. • Imagewise CNN: Used as a baseline. 8 Conv layers with ReLu, 3 FC layers. 738 Baseline Model Refer to paper here.* 1 1128 111 28  28 128 56  56  64 images 112  112 patches 112  112  48 convolution + activation 1 output 2× 2pooling fully connected + activation *Kim, et al., Deep convolutional neural models for picture quality prediction, IEEE Signal Processing Magazine, 2017. 739 Outcomes • Deep models can (but not always) beat shallow perceptual models, like BRISQUE. CORNIA and FRIQUEE are much more complex than BRISQUE. • ResNet-50 shows clear superiority wrt to other models. • Human performance is around 0.95+ … long way to go 740 Deep NR IQA Models • Still not big enough for end-to-end design of really big networks, like ResNet-50. These types of networks require millions of datum, or they overfit. • Such studies are planned / in progress but are very expensive! • However, pretty good results are attained using finetuning. • In a broad study it was found that pre-trained deep learners can achieve better results than existing shallow perceptual predictors. • But not near human performance – a very hard problem, like object recognition. 741 EDGE DETECTION 742 EDGE DETECTION • Probably no topic in the image analysis area has been studied as exhaustively as edge detection. • Edges are sudden, sustained changes in AVERAGE image intensity that extend along a contour. • Edge detection is used as a precursor to most practical image analysis tasks. Many computer vision algorithms use detected edges as fundamental processing primitives. • Some reasons for this : - Enormous information is contained in image edges. An image can be largely recognized from its edges alone. - An edge map requires much less storage space than the image itself. It is a binary contour plot. 743 Edges? 744 Edges? … Shapes? Neon Color Spreading Illusions 745 Overview of Edge Detection Methods • Methods for edge detection have been studied since the mid-1960's. • Easily been the most studied problem in the image analysis area. • Hundreds of different approaches devised - based on just about any math technique imaginable for deciding whether one group of numbers is larger than another. • The problem became (at last) fairly well-understood in the 1980's. • We will study three fundamental approaches: - Gradient edge detectors - Laplacian edge detectors (multi-scale) - Diffusion-based edge detectors 746 Edge detection before computers 747 GRADIENT EDGE DETECTORS • Oldest (Roberts 1965) but still valuable class of edge detectors. • For a continuous 2-D function f(x, y), the gradient is a two-element vector: f(x, y) = [fx(x, y), fy(x, y)]T where f x (x, y) =   f(x,y) and f y (x, y) = f(x,y) x y are the directional derivatives of f(x, y) in the x- and y-directions: y x 748 Gradient Measurements • Fact: the direction of the fastest rate of change of f(x, y) at a point (x, y) is the gradient orientation -1  f y (x, y)  θf (x, y) = tan   f (x, y)  x  • and that rate of change is the gradient magnitude M f (x, y) = f x2 (x, y) + f y2 (x, y) • The gradient is appealing for defining an edge detector, since we expect an edge to locally exhibit the greatest rate of change in image intensity. 749 Isotropic Gradient Magnitude • Fact: Take directional derivatives in any two perpendicular directions, (say x´ and y´) and Mf(x, y) remains unchanged: y´ x´ • Mf(x, y) is rotationally symmetric or isotropic. • Isotropicity is desirable for edge detection –detect edges equally regardless of orientation. 750 Gradient Edge Detector Diagram • Described by the following flow diagram: image I discrete differentiation point operation edge enhancement threshold edge map E detection - digital differentiation: digitally approximate I - point operation: estimate MI = |I| - threshold: decide which large values of MI are likely edge locations 751 Digital Differentiation • Digital differentiation is differencing. For a 1-D function f(x) which has been sampled to produce f(i), either: d dx or f(x) x=i  f(i) - f(i-1) d f(i+1) - f(i-1) f(x)  dx 2 x=i • The advantage of the first is that it takes the current value f(i) into the computation; the advantage of the second is that it is centered. 752 2-D Differencing • In two dimensions, the extensions are easy:  f(x, y)  f(i, j) - f(i-1, j) x (x,y)=(i,j) and  f(x, y)  f(i, j) - f(i, j-1) y (x,y)=(i,j)  f(i+1, j) - f(i-1, j) f(x, y)  x 2 (x,y)=(i,j)  f(i, j+1) - f(i, j-1) f(x, y)  y 2 (x,y)=(i,j) 753 Example: 1-D Differencing • The signal I(i) could be a scan line of an image. Let DI(i) = |I(i)-I(i-1)|. 20 15 10 I(i) = DI(i) = 10 0 5 0 0 3 6 9 12 15 0 3 • A clear, easily thresholded peak! 6 9 12 15 754 Example: 1-D Differencing With Noise • The signal J(i) = I(i) + N(i) may be a noisy image scan line. Let DJ(i) = |J(i)-J(i-1)|. 20 8 6 J(i) = 10 DJ(i) = 4 2 0 0 0 3 6 9 12 15 0 3 6 9 12 15 • Noise is a huge problem for this type of edge detector. Differentiation always emphasizes high frequencies (such as noise) 755 Types of Gradient Edge Detectors • Define convolution edge templates x and y which produce directional derivative estimates: 1  x  1 1  y  1  x  1 0 1 / 2 1 0 x  0 1 1 y  0 / 2 1 0 1 y  1 0 adjacent centered Roberts’ • The performance of these three is very similar. 756 Noise-Reducing Variations • Designed to reduce noise effects by averaging along columns and rows: 1 0 1  x  1 0 1 / 3 1 0 1 1 1 1 y  0 0 0 / 3 1 1 1 1 0 1 1 2 1  x  2 0 2 / 4  y  0 0 0 / 4 1 2 1 1 0 1 • These also perform similarly. Prewitt Sobel 757 Gradient Magnitude • The point operation combines the directional derivative estimates x and y into a single estimate of the gradient magnitude. • The usual estimates: (A) M(i, j) =  2x (i, j) +  2y (i, j) (B) M(i, j) =  x (i, j) +  y (i, j)  (C) M(i, j) = max  x (i, j) ,  y (i, j)  • The following always hold (Exercise): C≤A≤B • (A) is the correct interpretation, but (B) and (C) are cheaper - no square or square root operations. • (B) often overestimates edge magnitude of an edge, while (C) often 758 underestimates edge magnitude. The Texas Instruments Gradient Magnitude • Better than (B) or (C) is    1 (D) M(i, j) = max  x (i, j) ,  y (i, j)  min  x (i, j) ,  y (i, j) 4 • Still, the differences between (A)-(D) are slight. 759  Thresholding the Gradient Magnitude • Once the estimate M(i, j) is obtained, it is thresholded to find plausible edge locations. • This produces the binary edge map E: 1 ; M(i, j) > τ E(i, j) =  0 ; M(i, j)  τ • Thus: - a value '1' indicates the presence of an edge at (i, j) - a value value '0' indicates the absence of an edge at (i, j) • The threshold t constrains the sharpness and magnitude of the edges that are detected. 760 DEMO Gradient Edge Detector Advantages • Simple, computationally efficient • Natural definition • Work well on "clean" images 761 Gradient Edge Detector Disadvantages • Extremely noise-sensitive • Requires a threshold - difficult to select - usually requires interactive selection for best results. • Gradient magnitude estimate will often fall above threshold over a few pixels distance from the true edge. So, the detected edges are often a few pixels wide. • This usually requires some kind of "edge thinning" operation - usually heuristic. • The edge contours are often broken - gaps appear. This requires some kind of "edge linking" operation - usually heuristic. 762 LAPLACIAN EDGE DETECTORS • Edge detectors are based on second derivatives. • For a continuous 2-D function f(x, y), the Laplacian is defined: 2 2   2f(x, y) = f(x, y) + 2 f(x, y) 2 x y • It is a scalar not a vector. • Fact: If the directional derivatives are taken in any other two perpendicular directions (say x´, y´) the value of the Laplacian 2f(x, y) remains unchanged: y´ x´ 763 Laplacian Edge Detector Diagram • The Laplacian edge detector is described by the following flow diagram: image I discrete differentiation zero-crossing detection edge map E edge enhancement detection • Digital differentiation: Digitally approximate 2I • Zero-crossing detection: Discover where the Laplacian crosses the zero level. 764 Reasoning Behind Laplacian 1-D Edge Profile: Differentiated Once: Differentiated Again: • A zero crossing or ZC occurs near the center of the edge where the slope of the slope changes sign. 765 Digital Twice-Difference • For a 1-D function f(x) → f(i), we use: d f(x)  f(i) - f(i-1) = y(i) dx x=i and then d2 f(x)  y(i+1) - y(i) = f(i+1) - 2f(i) + f(i-1) 2 dx x=i with convolution template: 1 2 1 766 Digital Laplacian • In two dimensions: 2I(x, y) ≈ [I(i+1, j) - 2I(i, j) + I(i-1, j)] + [I(i, j+1) - 2I(i, j) + I(i, j-1)] with convolution template: 1 0 1 0 1 2 1  2  1 4 1 1 0 1 0 767 Example: Twice-Differentiation • Let I(i) be image scan line, I (i) = I(i+1) - 2I(i) + 2I(i-1) 20 20 10 I(i) =  I (i) = 10 0 -10 -20 0 0 3 6 9 12 15 0 3 6 9 12 15 • Clearly reveals a sharp edge location: a single largeslope ZC and several "smaller" zero-crossings. 768 Example: Twice-Differentiation in Noise • Scan line J(i) = I(i) + N(i) of a noisy image: 20 10  I (i) = I(i) = 10 -10 0 0 • 0 3 6 9 12 15 0 3 6 9 12 15 Numerous spurious ZCs. • Noise is an even bigger problem for this type of edge detector. Differentiating twice creates highly amplified noise. DEMO 769 Smoothed and Multi-Scale Laplacian Edge Detectors • Laplacian too noise-sensitive to be practical there is always noise. • But modifying it in a simple way it can be made very powerful. • The basic idea is encapsulated: image I Low-pass filter G Laplacian  2 detect zero crossings edge map E - smooth noise - determine edge scale • The difference: a linear blur (low-pass filter) is applied prior to application of the Laplacian. 770 Low-Pass Pre-filter • The main purpose of a low-pass pre-filter to the Laplacian is to attenuate (smooth) high-frequency noise while retaining the significant image structure. • The secondary purpose of the smoothing filter is to constrain the scale over which edges are detected. • Note: a high-pass operation (such as Laplacian) followed (or preceded) by a low-pass filter will yield a band-pass filter, if their passbands overlap. 771 Gaussian Pre-filter • A lot of research has been done on how the filter G should be selected. • It has been found that the optimal smoothing filter in the following two simultaneous senses: (i) best edge location accuracy (ii) maximum signal-to-noise ratio (SNR) is a Gaussian filter: (K is an irrelevant constant)  G(i, j) = K  exp - i 2  j2  / 2σ 2  772 Laplacian-of-Gaussian Edge Detector • Define the Laplacian-of-a-Gaussian or LoG on I: J(i, j) = 2[G(i, j)*I(i, j)] = G(i, j)*2I(i, j) = 2G(i, j)*I(i, j) • Above 3 forms equivalent since linear operations (differentiation and convolution) commute. • Best approach: pre-compute the LoG:  i 2  j2   i 2  j2   G(i, j)  1   exp   2  2  σ 2σ     2 and convolve image I with it. (Constant multiplier omitted since only ZCs 773 are of interest). Polar Form of LoG • The LoG is isotropic and can be written in polar form: 2  2    r r 2G(r)   1  2   exp   2   σ   2σ   2 G(r) DFT rotate through 360° 2s LoG in space and frequency 774 ZC Detection • The last stage of edge detection is zero-crossing detection. • Let J = [J(i, j)] be the result of LoG filtering. • A ZC is a crossing of the zero level: the algorithm must search for pixel occurrences of the form:   +     +  0   0   0   0  • By a convention one sign or the other is marked as the edge location unless higher (sub-pixel) precision is needed. 775 Scale of LoG • The larger the value of s used, the greater the degree of smoothing by the low-pass pre-filter G. • If s is large, then noise will be greatly smoothed but so will less significant edges. • Noise sensitivity increases with decreases in s but the LoG edge detector then detects more detail. DEMO 776 Digital Implementation of LoG • Use the sampled LoG:  i 2  j2   i2  j2   G(i, j)  1   exp   2  2  σ    2σ  2 • There are specific rules of thumb that should be followed: • Enough of 2G(i, j) must be sampled. The LoG will not work unless the template contains both main and minor lobes. In practice, the radius R of the LoG (in space) should satisfy R ≥ 4s (in pixels) • Once a LoG template is computed, its coefficients must be slightly adjusted to sum to zero (why?) • This is done by subtracting the (average) total coefficient sum from each. • The LoG will not work well unless s ≥ 1 (pixel)! DEMO 777 Thresholding the ZCs • Thresholding not usually necessary if a sufficiently large operator (s) is used. • However, sometimes it is desired to both detect detail and not detect noise. • This can be accomplished with effectively by thresholding. • Let J(i, j) be the LoG-filtered image and E(i, j) be the edge map. • Then find the gradient magnitude |J(i, j)| (Roberts' form will suffice since J is smooth). • If |J(i, j)| > t = threshold and E(i, j) = 1 (a ZC exists at (i, j)) then leave the ZC. Otherwise delete it. DEMO 778 Contour Thresholding • Problem: Simple thresholding may create broken ZC contours. • Approach: Compare all ZC pixels on a ZC contour to a threshold t. If enough are above threshold, accept the contour, else reject it. • Suppose (i1, j1), (i2, j2), (i3, j3) ,..., (iL, jL) comprise an 8-connected ZC contour. • Compute |J(in, jn)| for n = 1 ,..., L. • Let Q = # points such that |J(in, jn)| > t. • If Q/L > PERCENT then accept the entire ZC contour Q/L ≤ PERCENT then reject the entire ZC contour • Typically, PERCENT > 0.75 DEMO 779 Advantages of the LoG • Usually doesn't require thresholding. • Yields single-pixel-width edges always (no thinning!). • Yields connected edges always (no edge-linking!). • Can be shown to be optimal (under some criteria). • Appears to be very similar to what goes on in biological vision. 780 Disadvantages of the LoG • More computation than gradient edge detectors. • ZCs continuity property often leads to ZCs that meander across the image. • ZC contours tend to be over-smooth near corners. 781 Some Highly Relevant Neuroscience 782 The Retina Close Up Light Light From Gray’s Anatomy Key neurons: rods, cones, horizontal cells, bipolar cells, amicrine cells, ganglion cells 783 Retinal Neurons Light ganglion cells amicrine cells horizontal cells granules rods cone Retina is about 0.025 cm thick with about 100,000,000 photoreceptors Cones are photoreceptors operative in well-lit conditions (photopic vision); respond to colors. Rods are photoreceptors operative in low-light conditions (scotopic vision); monochromatic. Granules: connectivity to next layer Horizontal cells: Interconnects and spatially sums either bipolar cells rods or cone outputs (not both). Bipolar cells: Connects horizontal cell outputs to ganglion cells with positive (‘ON’) or negative (‘OFF’) polarity. Amicrine cells: Believed to control light adaptivity of photoreceptors and adjacent cells. Ganglion cells: Spatially integrates the responses of rods photoreceptors – essentially via other cells. 784 Ganglion Cells • Spatial responses to visual stimuli known as receptive fields. • Sums responses of photoreceptors. Each receives input from about 100 photoreceptors (with great variability). • Receptive field response is center-surround excitatoryinhibitory with on-center or off-center. • Actually a form of spatial digital filtering. • Output to the visual cortex via the optic nerve, chiasm, and lateral geniculate nucleus (LGN). 785 Center-Surround Response • Also called lateral inhibition. - + - + - + + - + + + + 786 Response to Herman Grid Non-foveal: Little photoreceptor response, perception dominated by (larger) ganglion field responses. At “intersections”, excitation and lateral inhibition cancel creating a small response, hence appears dark. Find the black dot Fovea: Ganglion receptive fields very small. Perception dominated by photoreceptors, i.e., direct luminance is perceived. 787 Similar Illusion with Color 788 Response to Mach Band Mach Bands 789 Difference-of-Gaussian Approximation • Nobel laureates R. Granit and H.K. Hartline measured ganglion receptive field responses in cats. • Well-approximated by difference-of-gaussian (DOG): 2 1  (i2  j2 )/2 s2 1  (i 2  j2 )/2 ks  DOG(i, j)  e  e 2 2 2ps 2 p  ks  790 Plot of 1-D DOG Note excitatory and inhibitory lobes 791 Plots of 2-D Spatial DOG Note excitatory-inhibitory regions 792 Computed Responses of DOG Filters on Hermann Grid DOG 793 DOG Ganglion Cells There is a diverse distribution of DOG receptive field sizes across the retina…. 794 What are DOG “Filters” for? • Two main theories: – Decorrelation of the input signal – Edge enhancement/detection 795 Decorrelation The response of each DOG is weakly correlated (or uncorrelated) from other DOG responses, captures unique information. Leads to efficient representations. Similar to how image compression algorithms work. Look at these “low entropy responses.” Small filterbank Image Some DOG responses 796 Compressibility • Each DOG response is highly compressible 797 LoG Closely Approximates Ganglion Receptive Fields! DOG LoG Overlaid • If k ~ 1.6, the DOG and LoG are almost indistinguishable 2 1  (i2  j2 )/ 2 s2 1  (i 2  j2 )/2 ks   G s (i,i)  e  e 2 2 2ps 2p  ks  798 2 Improving on the LoG / DOG 799 Canny’s Edge Detector • Attempts to improve upon the LoG. • Consider an ideal step-edge image: derivative || to edge derivative ^ to edge • Since 2 is isotropic, it is equivalent to taking: - A twice-derivative perpendicular to the edge. This conveys the edge information. - A twice-derivative parallel to the edge. This conveys no edge information! • In fact, if there is noise in the image, the parallel twicederivative will give only bad information! 800 Canny’s Algorithm • Follows the following basic steps: (1) Form the Gaussian-smoothed image: K(i, j) = G(i, j)*I(i, j) (2) Compute the gradient magnitude and orientation: |K(i, j)| and K(i, j) using discrete (differencing) approximation. 801 Canny’s Algorithm • (3) Let n be the unit vector in the direction K(i, j). Compute the twice-derivative of K(i, j) in the direction n: 2  K(i, j) = 2 n K xx (i, j)K 2x (i, j) + 2K xy (i, j)K x (i, j)K y (i, j) + K yy (i, j)K 2y (i, j) K(i, j) 2 (4) Find the ZC’s in the image. (5) Identically the zero-crossings of K × (K × K). • Disadvantage: nonlinear, so edge continuity not guaranteed. However, contour thresholding can improve performance. DEMO 802 Comments • We will continue studying image analysis … onward to Module 8. 803 Module 8 Image Analysis II • Superpixels • Hough Transform • Finding Objects and Faces QUICK INDEX 804 SUPERPIXELS 805 Simple Iterative Linear Clustering (“SLIC”) • A simple way of image segmentation. • Not into objects, which is very hard, but instead into “superpixels,” which are meaningful atomic regions. • More than pixels, less than objects or whole regions. • SLIC is particularly effective. Really a clustering method. • Most effective when color is used. 806 SLIC Initialization Steps 807 SLIC Distances 808 SLIC Iteration • Initialize: at each pixel (i, j) set l(i, j) = -1 and d(i, j) = infinity • Iterate: For every cluster center Ck, do the following: – For each pixel (i, j) associated with Ck, – Compute the distance D between Ck and (i, j) – If D < d(i, j), then set d(i, j) = D and set l(i, j) = k. • Compute a new set of cluster centers (e.g., center of mass of each cluster) • Form an error between the old cluster centers and the new ones. If below threshold, STOP. • Or, can just iterate T times. Typically T = 10. In examples T = 10. 809 SLIC Examples k = 64, 256, 1024 810 SLIC Examples Video 811 SLIC Comments • Really a simple variation of k-means clustering in a local application. • A very popular method! • It is used to create intermediate features in a wide variety of computer vision problem solutions. • As compared to the low level features we will shortly encounter (edges, SIFT keypoints, local binary patterns, etc). • It is an ad hoc method (not based on a theory or perceptual principles), yet still effective. 812 HOUGH TRANSFORM 813 Hough Transform: Line and Curve Detection • The Hough Transform is a simple, generalizable tool for finding instances of curves of a specific shape in a binary edge map. • What it does: edge map "Circle" Hough transform result 814 Advantages of Hough Transform • It is highly noise-insensitive - it can "pick" the shapes from among many spurious edges. Any edge detector gives rise to "pseudo-edges" in practice. • It is able to reconstruct "partial" curves containing gaps and breaks to the "ideal" form. • It can be generalized to almost any desired shape. 815 Disadvantages of Hough Transform • It is computation- and memory-intensive. 816 Basic Hough Transform • Assume that it is desired to find the locations of curves that - Can be expressed as functions of (i, j) - Have a set (vector) of parameters a = [a1 ,..., an]T that specify the exact size, shape, and location of the curves. • Thus, curves of the form f(i, j; a) = 0 817 Curves With Parameters • 2-D lines have a slope-intercept form f(i, j; a) = j - mi - b = 0 where a = (m, b) = (slope, j-intercept) • 2-D circles have the form f(i, j; a) = (i - i0)2 + (j - j0)2 - r2 = 0 where a = (i0, j0, r) = (center coordinates, radius) 818 Basic Hough Transform • Input to Hough Transform: An edge map (or other representation of image contours). • From this edge map, a record (accumulator) is made of the possible instances of the shape in the image, and their likelihoods, by counting the number of edge points that fall on the shape contour. 819 Hough Example • Straight line / line segment detection The "most likely" instance of a straight line segment in this region. "Blow-up" of an edge map • The line indicated is the "most likely" because there are seven pixels that contribute evidence for its existence. • There are other "less likely" segments (containing 2 or 3 edge pixels). 820 Hough Accumulator • The Hough accumulator A is a matrix that accumulates evidence of instances of the curve of interest via counting. • The Hough accumulator A is n-dimensional, where n is the number of parameters in a: a = [a1 ,..., an]T • Each parameter ai, i = 1 ,..., n can take only a finite number of values Ni (the representation is digital). • Thus the accumulator A is an N1 x N2 x · · · · · · x Nn-1 x Nn matrix containing N1·N2·N3 · · · · · · Nn-1·Nn slots. 821 Size of Hough Accumulator • The accumulator A becomes very large if: - Many parameters are used - Parameters are allowed to take many values • The accumulator can be much larger than the image! • It is practical to implement the accumulator A as a single vector (concatenated rows): otherwise many matrix entries may always be empty (if they're "impossible"), thus taking valuable space. 822 823 Accumulator Design • The design of the Hough accumulator A is critical to keep its dimensions and size manageable. • Creating a manageable Hough Accumulator is an art. • The following are general steps to follow. 824 Accumulator Design STEP ONE - Use appropriate curve equations. For example, in line detection, the slope-intercept version is poor (nearly vertical lines have large slopes). • A better line representation: polar form f(i, j; a) = i cos() + j sin() - r = 0 where a = (r, ) = (distance, angle) r  825 Accumulator Design STEP TWO - Bound the parameter space. Do a little math. Only allow parameters for curves that sufficiently intersect the image. • What is "allowable" depends on the application - perhaps (for example) circles - Must lie completely inside the image - Must be of some min and max radius Interesting lines solid Irrelevant lines dashed Interesting circles solid Irrelevant circles dashed 826 Speaking of Circles Which of these arcs is a piece of the largest circle? 827 Example of Accumulator Design • In an N x N image indexed 0 ≤ i, j ≤ N-1, detect circles of the form f(i, j; a) = (i - i0)2 + (j - j0)2 - r2 = 0 having min radii 3 (pixels) and max radii 10 (pixels), where the circles are contained by the image. • This could be implemented, for example, in a bus’s coin counter. • Clearly: 3 ≤ r ≤ 10 and for each r : r ≤ i0 ≤ (N-1) - r r ≤ j0 ≤ (N-1) - r or part of some of the circles will extend outside of the image. • Note: This rectangular accumulator array leads to unused array entries. 828 Example of Accumulator Design • In an N x N image indexed 0 ≤ i, j ≤ N-1, detect lines of the form f(i, j; a) = i cos() + j sin() - r = 0 that intersect the image. • Thus each line must intersect the four sides of the image in two places. • Either the i-intercept or the j-intercept must fall in the range 0 ,..., N-1: i-intercept = r / cos() j-intercept = r / sin() • So we can bound as follows: either 0 ≤ r / cos() ≤ N-1 or 0 ≤ r / sin() ≤ N-1. 829 Accumulator Design STEP THREE - Quantize the accumulator space. Curves are digital so the accumulator is a finite array. • Decide how "finely-tuned" the detector is to be. • For circles, the choice may be easy - let circle centers (i0, j0) and circle radii be integer values only, for example. • For lines, the selection can be more complex - since digital lines aren't always "straight":* 63° 45° 135° 90° 117° 0° Even drawing straight lines digitally requires an approximation algorithm such as the one by Bresenham or the one by Wu. Likewise, no drawn digital circle is really a circle, but rather an approximation. These are links to Wikipedia. 27° 153° ETC 830 Example of Accumulator Design • In an N x N image, detect circles of the form f(i, j; a) = (i - i0)2 + (j - j0)2 - r2 = 0 • Radius and center constraints: 3 ≤ r ≤ 10 and r ≤ i0 ≤ (N-1) – r and r ≤ j0 ≤ (N-1) – r • Assume radii and circles centers are integers: r {3, ..., 10} and for each r the circles are completely contained within the image: r ≤ i0 ≤ (N-1) – r and r ≤ j0 ≤ (N-1) – r • Then the accumulator contains the following number of slots: # circle types = [N - 2r]2 = 8N2 - 208N +1520 ≈ 8N2 • This is about 8 times the image size! • Circle detection algorithms often assume just a few radii, e.g., in the coin counter example, just a few coin sizes. 831 Accumulator Design STEP FOUR - (Application) (1) Initialize the accumulator A to 0. (2) At each edge coordinate (i, j), [E(i, j) = 1] increment A: - For all accumulator elements a such that (e small) |f(i, j; a)| ≤ e set A(a) = A(a) + 1 until all edge points have been examined. (3) Threshold the accumulator. Those parameters a such that A(a) ≥ t represent instances of the curve being detected. • Since local regions of the accumulator may fall above threshold, detection of local maxima is usually done. 832 Comments on the Hough Transform • Substantial memory required even for simple problems, such as line detection. • Computationally intensive; all accumulator points are examined (potentially incremented) as each image pixel is examined. • Refinements of the Hough Transform are able to deal with these problems to some degree. DEMO 833 Refining the Hough Transform • The best way to reduce both computation and memory requirement is to reduce the size of the Hough array. Here is a (very) general modified approach that proceeds in stages: (0) Coarsely quantize the accumulator. Define error threshold e0 and variable thresholds ti , ei. (1) Increment the Hough array wherever (where em > e0) | f(i, j; a) | ≤ em (2) Threshold the Hough array using a threshold tm. (3) Redefine the Hough array by - Only allowing values similar to those ≥ tm - Re-quantize the Hough array more finely over these values (4) Unless em < e0, set em+1 < em, tm+1 < tm and go to (1) • The exact approach taken in such a strategy will vary with the application and available resources, but generally the Hough array can be made much smaller (hence faster) 834 “My students are my products!” - Paul Hough 835 VISUAL SEARCH 836 Where’s The Panda? Where’s the Panda? 837 Where He Got the Idea Again … where’s the Panda? 838 Luggage Nightmare 839 Find The Faces (there are 11 of them!) 840 Find The Faces and the Animals (8 hidden faces and 8 hidden animals) Currier and Ives advertisement from year 1872 841 find the face 842 Painted by Giuseppe Arcimboldo in year 1590 find the face 843 find the … 844 Find the Snow Leopard 845 Maybe this Fellow is Easier 846 TEMPLATE MATCHING 847 TEMPLATE MATCHING • Find instances of a sub-image, object, written character, etc. • Template matching uses special windows. • A template is a sub-image: Template image of ‘P’ 848 Template • Associate with a template T a window BT: T = {T(m, n); (m, n)  BT}. • As before, the windowed set at (i, j) is: BTI(i, j) = {I(i+m, j+n); (m, n)  BT} • Goal: Measure goodness-of-match of template T with image patches BTI(i, j) for all (i, j) 849 Template Matching • Start with mis-match (MSE) between BTI(i, j) and T: MSE{BT I(i, j), T}     I(i+m, j+n)-T(m, n) 2 (m,n)  B T    I(i+m, j+n) -2   I(i+m, j+n)T(m, n)+    T(m, n) (m,n)  BT  BT (m,n)  B T  (m,n)     2 local image energy 2 cross-correlation of I and T template energy (constant) = E BT T(i, j)  2CBT T(i, j), T  E T • MSE is small when match CBT T(i, j), T is large. 850 Normalized Cross-Correlation • Upper bound: CBT T (i, j), T =    I(i+m, j+n)T(m, n) (m,n)  B T    I(i+m, j+n)   T(m, n) 2 (m,n)  B T 2 (m,n)  B T  E BT T (i, j)  E T with equality if and only if I(i+m, j+n) = K·T(m, n) for all (m,n)  B T From Cauchy-Schwartz Inequality:  (m, n) A(m, n)B(m, n)  2 2   A(m, n)   B(m, n) (m, n) (m, n) 851 Normalized Cross-Correlation • Let ĈBT T (i, j), T = CBT T (i, j), T E BT T (i, j)  E T ˆ hence 0  C B T T (i, j), T  1 for every (i, j). • Normalized cross-correlation image: J = CORR[I, T, BT] J(i, j) = ĈBT T(i, j), T for 0  i  N-1, 0  j  M-1 852 Normalized Cross-Correlation Value 1 Image patch Image Normalized cross-correlation map 853 Thresholding • Find the largest value (best match): J = CORR[I, T, BT] • Alternately, threshold: K(i, j) = 1 if J(i, j) ≥ t • If t = 1, only perfect matches will be found. • Usually t is close to but less than one. • Finding t usually a trial-and-error exercise. DEMO 854 Limitations of Template Matching • Template matching is quite sensitive to object variations. • Noise, rotation, scaling, stretching, occlusion … and many other changes confound the matching. 855 SIFT (Scale Invariant Feature Transform) 856 A SIFT Primer • SIFT (Scale Invariant Feature Transform) is a powerful and popular method of visual search / object matching.* • It can find objects that have been rotated and scaled. • At has limited ability to match objects that have been deformed. • We will outline the steps involved and the features that are used in SIFT. *The brainchild of David Lowe, U. British Columbia 857 Basic Idea Pre-computed New unknown object. Compute SIFT features (keypoints). Identified object (if any) Feature matching Database of known object images. Set of SIFT features (keypoints) for each object. The number of “keypoint” SIFT features may be quite large.* They are related to edges. *Perhaps 2000 for a 512x512 image. 858 SIFT Features • Keypoint candidates are the maxima of DoG-filtered responses. Given gaussian at scale sigma:  1 i 2  j2  / 2σ 2 G s (i, j) =  exp   2ps2  • The DoG response to image I is Ds (i, j) =  G ks (i, j)  G s (i, j)*I(i, j) where k = 2(1/s) (often s = 2). Note : G ks (i, j)  G s (i, j)  (k-1)s 2G s (i, j) 859 DoGs Computed Over Scales Gs Gk s _ _ _ G k (P-2) s _ G k (P-1) s _ G kP s Image I Gaussians DoG-filtered I 860 Extrema Detection A DoG-filtered sample is an extrema if largest or smallest of 26 surrounding pixels in scale-space Adjacent scales DoG responses (close-up) All extrema are found over all DoG scales. 861 Extrema Processing • Each “extrema” keypoint is then carefully examined to see whether:  It is actually located near/on an edge  If so, if the edge has a strong enough contrast (based on a local interpolation) otherwise it is rejected. • Remaining extrema are assigned an orientation that is the gradient orientation of the DoG response. • These are the final keypoints. 862 SIFT Features at Keypoints • All keypoints for an image stored in database are stored and associated with location, scale, and orientation. Depiction of histograms of gradients from local patches around each keypoint. Length of each arrow is (distanceweighted) sum of gradient magnitudes near that direction. • At the location and scale of each keypoint, the gradient magnitudes and orientations (relative to the keypoint orientation) are found at all points in a patch around the keypoint. • These are formed into descriptors that vectorize the histograms of the gradient magnitudes/orientations from sub-patches of the larger patch. • Lastly, these vectors are made unit-length to reduce the effects of illumination. 863 Matching • When a new image is to be matched, the SIFT descriptors are computed from it. • The database consists of a set of images, possibly large, each with associated SIFT descriptors. • Matching the image descriptors to the database requires a process of search which can be done in many possible ways. Lowe uses nearest-neighbor search (Euclidean distance). • Lowe also uses a Hough Transform – like method to cluster and count keypoint descriptors. This matching process also uses keypoint location, orientation, and scale relative to those found in the database. • For details, see Lowe’s paper: link. 864 Lowe’s Examples Image Objects being searched for. Recognition results Outer boundaries show matches keypoints are indicated by centers of squares (sized by scale) 865 Example Object found in image. Object in database. Keypoints at +s. The object found has undergone, rotation, shift, illumination, and scale change. It has also undergone an affine change in perspective. 866 Video Examples Magazine being moved through space View from a moving vehicle What is shown” SIFT feature tracking on two video scenes, and a modified “Affine SIFT” 867 Comments on SIFT • SIFT utilizes both fundamental and ad hoc ideas. Yet is works very well within its domain (finding rigid objects). • It has strong invariance to: • • • • Rotations Translations Scale changes Illumination changes • It has reasonable invariance to: • Occlusions • Affine transformations • It has weak invariance to: • Objects that deform or change (like faces) • You can get a SIFT demo program at David Lowe’s 868 website: http://www.cs.ubc.ca/~lowe/keypoints/ Speaking of Orientation…. 869 Bouba and Kiki Class Poll 870 General-Purpose Feature Extractors • An extraordinary number of feature types have been proposed for image analysis. Often used to find “keypoints,” like SIFT. • Used in a great variety of visual tasks: image/object detection, recognition, matching, tracking, classification and many more. • Generally they are configured to extract local properties such as edges, corners, texture, change, interestingness… • Usually the features are illumination-invariant (difference-based). 871 General-Purpose Feature Extractors • Most are rather general-purpose, and are often a bit ad hoc with some underlying sensibility. SIFT falls in this category. • Following are some other features, explained in their simplest forms. All can and have been adapted to supply scale-invariance, rotationinvariance, and other ways of generalizing them. • This is the classical image analysis paradigm: • • – Find highly descriptive features in the image – Invariant to some of scale, orientation, illumination, illumination gradients, affine transformations (object changing pose in 3D space), etc. – Feed the features to a classifier or regressor to train it to conduct a task using these features Application: extract the same features on images, using the trained regressor/classifier to conduct the same task Still dominant in practice, but deep learning is changing this. 872 Harris Features 873 Harris Features • Let I x (i, j) = I(i, j)   x (i, j) I y (i, j) = I(i, j)   y (i, j) x and y are directional difference operators, like Sobel. • Let G(I, j) be a gaussian. Define smoothed derivatives: 2 A(i, j) =  I x (i, j)  G(i, j) 2 A C M= B(i, j) =  I y (i, j)   G(i, j)  C B   C(i, j) =  I x (i, j)  I y (i, j)   G(i, j) • Harris operator det  M  -k  Trace  M     AB-C2  -k  A + B  2 2 874 Harris Features Local maxima result 875 SURF Features (Speeded-Up Robust Features) 876 SURF Features • Hessian matrix  I xx (i, j) I xy (i, j)  H=  I (i, j) I (i, j) yy  xy  where Iab (i, j)  I(i, j)  G ab (i, j)  2G G ab (i, j) = (i, j) ab • Then the SURF features (the authors use a “box” approximation to the gaussian for speed) det  H  = I xx (i, j)I yy (i, j)   I xy (i, j)  2 • Many variations (‘BRIEF,’ ‘BRISK,’ ‘FREAK’, etc.) 877 SURF Features Matching result 878 LBP (Local Binary Patterns) 879 LBP Features • An efficient approach. Break an image into 16x16 blocks: • For every pixel in the block, within its 3x3 neighborhood: – Going clockwise from top left, compare center pixel with 8 neighbors. If greater, code ‘0’, if not greater, code ‘1.’ – Creates an 8-bit code at every pixel. – The transform of the block into the code is called a Census Transform – It can be modified into a 9-bit code by comparing each neighborhood pixel with the neighborhood mean. Then called Modified Census Transform (MCT). – Going the further step to create the 256 8-bit code vector for each 16x16 block is LBP. 880 LBP Features Original Decimal LBP 881 Case Study: Face Detection 882 Detect or These these Faces? Faces? 883 Face Detection 884 Why is this Photograph Important? Taken by Robert Cornelius in 1839 It is the first “Selfie”! 885 Did You Detect the Face? 886 Face Detection Using ViolaJones Method • A very fast and accurate way to detect faces. • Now a classical approach used in many digital cameras, smart-phones, and so on. • Likely the “squares-around-the-faces” casual users see in their viewfinders is some version of the V-J model. • Basic Idea: Use simple, super-cheap features iteratively evaluated on small image windows using a “boosted” classifier. 887 Integral Image • Given image I(i, j), define the integral image II(i, j) = i j   I(m, n) m=0 n=0 • It is just the sum of all pixels within the rectangle with upper left corner (0, 0) and lower right corner (i, j): (0, 0) (i, j) • Exercise: Show the simple one-pass recursion: S(i, j) =S(i, j-1) + I(i, j) II(i, j) = II(i, j-1) + S(i, j) S(i, -1) = 0 II(-1, j) = 0 888 Rectangular Sum Property • Given a pre-computed integral image II(i, j) of image I(i, j). • The pixel sum of I(i, j) over any rectangle D can then be found by 4 table look-ups: A B C D • SUM(D) = SUM(ABCD) – SUM(AB) – SUM(AC) + SUM(A) 889 Viola-Jones Features • The original Viola-Jones concept computes four kinds of spatial differences of adjacent rectangle sums as features:  +  +  +   + +  • All sizes (scales), aspect ratios, and positions of each of these features is computed in 24x24 sub-windows on an image. • About 180,000 of these (highly over-complete: only 242 = 576 basis functions needed). A lot of training, but end algorithm is very fast! 890 Basic Idea: Boosting • A “cascade” of T very simple, weak, binary classifiers: All possible sub-windows 1 False True 2 False True 3 True False ··· T True Combine False Rejected sub-windows • First stage considers all possible sub-windows. Many are eliminated at this stage • Each stage considers all remaining sub-windows from most recent stage, rejecting most (but fewer with later stages). • Many fewer sub-windows considered later, but with greater computation • A sub-window must survive all stages. 891 Weak Classifier • For each of the many features j at each stage t, train a simple single-feature classifier ht,j, t = 1, …., T. • Each weak classifier ht,j gives a binary decision by finding the optimal threshold t,j so the minimum number of 24x24 training images are misclassified (face / no face). • Each classifier has the simple form: 1 ; if f j  Ii  >θ t,j h t, j  Ii  =  0 ;otherwise where fj is the computed rectangle-based feature indexed 892 j. Error Function   • Goal: Minimize the error function ε t, j =  w h I i  F i t,i t, j i over all N (24x24) training images Ii (many of these), where 1 ; Ii contains a face F i= 0;otherwise • Weights wt,i sum to 1 at each stage t, but change/adapt over time/stage. This is called adaptive boosting (AdaBoost). • At each stage t, choose the one classifier among {ht,j} having minimum et,j = et. Call this ht; it is the only one used! 893 Weighting Functions • Weights (penalties) are varied over t else nothing different happens. In the Viola-Jones model, a typical adaptation is used. Assume N = m + n training images. • First stage: 1/  2m  ; Fi = 0 (no face) w 1,i =  1/  2n  ; Fi =1 (face) • Normalize each stage: w  t,i • Update equation:  εt  w t+1,i = w t,i    1  εt  1  ei  m = # negatives n = # positives w t,i  N w j=1 t, j  0;Ii classified correctly where ei =   1;otherwise • Ad hoc but reasoning is sound 894 Final “Strong” Classifier • Use at most T features, one from each stage.  1 ; if  T α t h t  Ii  > (1/2) T α t t=1 t=1 h  Ii  =  0 ;otherwise where  1  εt  α t = log    εt  • Many realizations and variations exist. V-J first trained their model on about 5000 hand-labeled 24x24 images with faces (positives) and about 9500 non-face images (negatives). • At each stage a new set of 24x24 non-face images was selected automatically from about 350,000,000 contained in the about 9,500 non-face images. 895 Examples of V-J Labeled Faces 896 Viola-Jones Face Detector • The exemplar V-J face detector use(d) 38 stages (T=38) • When demonstrated on an old 700MhX Pentium on images of size 384x288, each image was classified in about 0.067 seconds. • About half of video frame rate (15 frames/second). • Easy to do in real-time today. Look at your smartphone / DSLR! • The Viola-Jones Face Detector was the first really large-scale success of machine learning in the image analysis field. 897 Examples • First two features selected by their cascade: From their paper DEMO Someone else’s version 898 Flashed Face Illusion Nobody understands this remarkable and disturbing illusion … 899 Comments • The V-J face detector was perhaps the first really large-scale “success” of machine learning in the image analysis field. • We will continue studying image analysis … onward to Module 8. 900 Module 9 Image Analysis III • • • • • Models of Visual Cortex Oriented Pattern Analysis Iris Recognition Range Finding by Stereo Imaging Deep Stereopsis QUICK INDEX 901 Let’s revisit the visual brain… 902 Primary Visual Cortex • Also called Area V1 or striate cortex. Much goes on here. • Visual signals from the ganglion cells of the eye are further decomposed into orientation- and scale-tuned spatial and temporal channels. • These are passed on to other areas which do motion analysis, stereopsis, and object recognition. 903 Primary visual cortex (from below) From D. Hubel, Eye and Brian 904 Types of Cortical Neurons • Early 1960’s, Nobel laureates D. Hubel and T. Wiesel performed visual experiments on cats. • Inserted electrodes into visual cortex, presented patterns to animals’ eyes, measured neuronal responses. • Found two general types of neurons termed simple cells and complex cells. • Defined by their spatial responses to images. 905 Simple and Complex Cells • Simple cells are well-modeled as linear. We can model their outputs by linear convolution with the image signal. • Complex cells receive signals from the simple cells. Their responses are nonlinear. 906 Stages on the visual pathway from retina to cortical cells. 907 Simple Cell Responses • Simple cells respond to (at least) three signal aspects: • Spatial pattern orientation, frequency and location • Spatial binocular disparity • Spatio-temporal pattern motion 908 Simple Cell Spatial Response • Simple cell responses have excitatory and inhibitory regions – multiple lobes. • Appear to match spatial edges (shown) and bars. Red: excitatory Blue: Inhibitory 909 Simple Cell Distribution • The simple cells spatial responses correspond to a wide range of orientations, lobe separations, and sizes. “Bar-sensitive” simple cells “Edge-sensitive” simple cells • Seemingly “spatially tuned” to detect edges and bars, thus representing images that way. They are well modeled as ... 910 2-D Gabor Function Model • Gabor functions are Gaussian functions that modulate (multiply) sinusoids. l controls elongation.   1 g c (x, y)  e 2 2pls 2    x  y2  / 2 s2  l    u0 v0  cos  2p  x  y   M   N  x 2 2 2   l   y  /2 s   u0 v0  1   g s (x, y)  e sin  2p  x  y   2 2pls M   N • “Cosine and “sine” Gabors are in 90 deg quadrature and model the bar and edge simple cells, respectively 911 Phase Quadrature “Edge-sensitive” “Bar-sensitive” • The simple cells appear to occur in phasequadrature pairs – 90 deg out of phase 912 Quadrature Gabor Pair “Bar-sensitive” cosine Gabor “Edge-sensitive” sine Gabor • The simple cells appear with widely variable orientations, elongations, and sizes. 913 Phasor Form of Gabor • Easy to manipulate g(x, y)  g c (x, y)  1g s (x, y)   1  e 2 2pls 2    x  y 2  /2 s2  l  e 2 p 1 u 0 x / N  v0 y/M  • Natural representation of simple cell pairs • Fourier transform is simply a shifted Gaussian  v)  e G(u, 2 2 2 2 ps   u  u 0  l 2   v  v0     914 Digital Form g(i, j)  g c (i, j)  1gs (i, j)   1  e 2 2pls 2    i  j2  /2 s2  l  e 2 p 1 u 0i/ N  v0 j/M  • Similar considerations as designing LoG 915 Minimum Uncertainty • Amongst all complex functions and in any dimension, Gabor functions uniquely minimize the uncertainty principle:  xf (x) 2 dx  uf (u) 2 du  1     2 2   f (x) dx   f (u) du  4    • Similar for y, v. • They have minimal simultaneous space-frequency duration. • Given a bandwidth, can perform the most localized spatial analysis. 916 Fourier Transform Magnitudes of Gabors Sine and cosine Gabors have same FT magnitudes. Phasor form only half plane.917 Gabor Function History • Gabor functions first studied in context of information theory by Nobel laureate D. Gabor. • In 1980 S. Marcelja noted that the simple cell receptive fields measured along 1D are well-modeled by 1-D Gabor functions which have optimal space-frequency localization. • In 1981 D.A. Pollen and S.F. Ronner showed that the simple cells often appear in phase quadrature pairs, which are natural for Gabor functions. • In 1985 J. Daugman observed that receptive field profiles fit 2D Gabor functions and have optimal 2D space-frequency localization. 918 Gabor Filterbanks • Exact arrangement of simple cells not known, but all orientations are possible (horizontal/vertical more common). • The cortex is layered. Simple cells that are in different layers but similar positions come from the same eye and are responsive to similar orientations. Called ocular dominance columns. • Bandwidths in the range 0.5 octave – 2.5 octave are common. • In engineering, constant-octave filterbanks covering the plane are common. 919 Gabor Filterbanks • Constant-octave Gabor filter bank (cosine form) • Nine orientations, four filters/orientation • Constant-octave Gabor filter bank (phasor form) • Eight orientations, five filters/orientation 920 The Image Modulation Model (where we model an image as a sine wave!) I(x, y)  A cos  f(x, y) 921 FM Approximation • We’ll do this in 1-D for simplicity. A 1-D continuous Gabor: g(x)  g c (x)  1g s (x) 1  x 2 / 2 s2 2 p  e e 2ps 1 u 0 x   G(u) e 2 ps   u  u 0  2 2 • 1-D FM image: I(x)  A cos  f(x)  Assuming the instantaneous frequency f(x) changes slowly, then A  g(x) * I(x)   G  f(x)   exp  1f(x)  2 • Contains responses of both cos Gabor and sin Gabor filters 922 2D FM Approximation • 2-D continuous Gabor: g(x, y)  g c (x, y)  1g s (x, y)  v) G(u, • 2-D FM image: I(x, y)  A cos  f(x, y) Assuming the instantaneous frequencies fx(x,y), fy(x,y) change slowly, then A  g(x, y) *I(x, y)   G fx (x, y), f y (x, y)   exp  1f(x, y)  2 • Contains responses of both cos Gabor and sin Gabor filters 923 FM Demodulation • 1-D Gabor-filtered FM image demodulation: A  g(x) * I(x)   G  f(x) 2 • 2-D Gabor-filtered FM image demodulation: A  g(x, y) * I(x, y)   G fx (x, y), f y (x, y)  2 924 Energy Model for Quadrature Simple Cell Demodulation Cosine Gabor (.)2 S Sine Gabor (.)2 925 Complex Cells • Less well-understood. Poorly understood and nonlinear responses. • Viewed as accepting simple cell responses as inputs, and processing them in ways that are not yet understood. • Simplest task might be the demodulation process on the previous slide. 926 While we do not know how the vision system processes the spatial simple cell (Gabor-like) responses, we can certainly use the model creatively for image processing. 927 Using V1 Cortical Cell Models • Massive decomposition of space-time visual data provides optimally localized atoms of pattern, pattern change, and pattern disparity information. • This information is passed en masse to other brain centers – to accomplish other processing. 928 Spatial Orientation Estimation • Now assume an image has a strong oriented component modeled as an FM function I(x) = A cos [f(x)], x = (x, y) • Like sine wave gratings discussed earlier, but local frequencies and orientations change. • Goal: Estimate  f ( x )   f x ( x ), f y ( x )  929 Image with strong FM component(s) 930 Frequency Modulation • The instantaneous frequencies define the direction and rate of propagation of the pattern:  f(x)  f 2x ( x )  f 2y ( x )   f ( x )  T an 1  f y ( x ) / f x ( x )  931 Demodulation • Simple approach: (1) Pass image through bank of complex Gabor filters ti (x) = I(x)*gi(x); i = 1 ,…, K (2) At each x find largest magnitude response (demodulate) i# = arg maxi | ti (x)| t#(x) = ti#(x) (3) Then  t # (x)  f(x)  Re   #  1  t (x)  932 Example Image Orientation and frequency needle map Reconstructed FM component 933 Example Image Orientation and frequency needle map Reconstructed FM component 934 Segmentation by Clustering Multi-partite image Reconstructed FM component Orientation and frequency needle map Segmentation by k-means clustering 935 Comments • Preceding algorithm utilizes the linear filter responses which closely model V1 simple cell responses. • The algorithms also use nonlinear processing of many simple cells responses which could be similar to V1 complex cell processing. • The image modulation model can be made much more general: K I(x, y)   Ai cos  fi (x, y) i=1 936 K=43 Components 937 K=43 Components 938 K=43 Components 939 K=43 Components 940 K=6 (only!) Components 941 The 6 Components + + + + + 942 943 Iris Recognition 944 Flow of Visual Data Area V5 or MT Dorsal stream Area V1 Ventral stream (object recognition, long-term memory) LGN 945 Ventral Stream • The ventral stream (also called the “what pathway”) begins with Area V1, and lands in the inferior temporal gyrus. • Much shared processing occurs between V1, V2 (on the way to ITG), and the ITG. • Areas devoted to object recognition, memory, and closer to consciousness. 946 Inferior Temporal Gyrus V1 Also called Inferior Temporal (IT) Cortex 947 Inferior Temporal Gyrus (Fusiform Gyrus) • Has available entire retinotopic map from Area V1. • Considerable feedback and forth from V1 • Devoted to visual memory and visual object recognition. • Not much known about “how” it’s done. • Computational algorithms for object recognition using V1 primitive models have proved effective. 948 Biometrics: Recognition of Iris 949 Daugman’s Iris Recognition System • One of the earliest approaches to recognition using V1 primitives was J. Daugman’s iris recognition system. • Uses spatial V1 (Gabor) models responses to perform important biometric application: recognition of the iris of the eye. • Far-seeing invention is still the industry standard of performance. 950 Iris Recognition (1) Detect and extract iris and “unroll it” 951 Iris Recognition (2) Filter each row with 1-D cosine (top) and sine Gabor filters, and binarize result. 952 Iris Recognition cardinality  A XOR B  H(A, B)  t cardinality  A  (3) Compare binary rows with binary reference of iris stored in database using Hamming Distance. • • • If A = B, then H(A, B) = 0. Random chance, H(A, B) = 0.5. For t = 0.3, typical false positive rate is < 10-7. 953 Seeing in 3-D • Three-dimensional vision involves many modes of perception: – – – – – Relative size Stereo vision Motion parallax Occlusion Foreshortening, convergence of lines, etc etc etc • A famous illusions on this theme … the Ames Room … the eye-brain can be fooled by 3D cues! • We will study a main mode: stereo vision 954 RANGE FINDING BY STEREO IMAGING • Next we will study a significant computer vision application: stereo vision. • The goal is to compute a 3-D map using multiple cameras. • We will introduce the stereo camera geometry and the concept of triangulation. • At the core of stereo vision is a very hard problem that is called The Correspondence Problem. • Of course, humans see in stereo quite well …. (click here) 955 Stereo Camera Geometry and Triangulation • Begin with the geometry for perspective projection: Z Y y f = focal length x image plane X lens center (X, Y, Z) = (0, 0, 0) • • • • • • (X, Y, Z) are points in 3-D space The origin (X, Y, Z) = (0, 0, 0) is the lens center (x, y) denote 2-D image points The x - y plane is parallel to the X - Y plane The optical axis passes through both origins 956 The image plane is one focal length f from the lens Relationship Between 3-D and 2-D Coordinates • We found that a point (X, Y, Z) in 3-D space projects to a point f (x, y) = (X, Y) Z in the 2-D image, f = focal length, = magnification factor. • We will now modify the camera geometry. First, we'll shift the camera to the left and right along the X-axis (in real space). • We will then consider the case where we have two cameras, equidistant from the origin (X, Y, Z) = (0, 0, 0). • By relating the projections (images) from the two cameras, we will find that it is possible deduce 3-D information about the objects in the scene. 957 Shifting the Camera to the Left • Suppose that we shift the camera (lens center) along the XZ axis by an amount -D: Y y´ image plane f = focal length x´ X lens center (X, Y, Z) = (-D, 0, 0) • Note: the coordinates of the image are now denoted (x’, y’). • Now a point (X, Y, Z) in 3-D space projects to a point f (x', y') = (X+D, Y) Z in the left-shifted image. 958 Shifting the Camera to the Right • Suppose we shift the camera (lens center) along the X-axis by amount +D: Z Y y´´ image plane f = focal length x´´ X lens center (X, Y, Z) = (+D, 0, 0) • • The coordinates of the image are now denoted (x’’, y’’). Now a point (X, Y, Z) in 3-D space projects to a point f (x'', y'') = (X-D, Y) Z • in the right-shifted image. Note that the optical axis is still parallel to the Z-axis. 959 Binocular Camera Geometry • Now suppose that we place two cameras (with the same focal lengths f), at a distance 2D apart (the baseline distance), with parallel optical axes: Z Y left (x´, y´) image plane f = focal length right (x´´, y´´) image plane X left lens center (X, Y, Z) = (-D, 0, 0) • right lens center (X, Y, Z) = (+D, 0, 0) This is a parallel or nonconvergent binocular camera geometry. • Suppose we are able to identify the images of an object feature (a point) whose coordinates are (X0, Y0, Z0) in 3-D space. • We could do this, e.g., interactively, by pointing a mouse at the image of the object feature in both of the images and determining the coordinates. 960 Triangulation • The projections of the point (X0, Y0, Z0) in the two camera images are and f (X 0 +D, Y) Z f (x 0'', y0'') = (X 0 -D, Y) Z (x 0', y0') = • The horizontal and vertical disparities between images of (X0, Y0, Z0) are: x0 = x0´ - x0´´ = (f/Z0) [(X0+D) - (X0-D)] = 2Df/Z0 y0 = y0´ - y0´´ = (f/Z0) (Y0 - Y0) = 0 • In a non-convergent system, the vertical disparity is always zero. • The horizontal disparity is extremely useful. We can solve for Z0 in terms of it: 2Df Z0 = x 0 961 Ramifications • The triangulation equation Z0 = 2Df x 0 implies that we can compute the distance, from the baseline (hence to anywhere), to any point in the scene that is visible to both cameras .... • ... provided we can find its horizontal image coordinates x0´ and x0´´. • This follows since the camera focal length f is known, the baseline separation 2D is known, and the (X, Y, Z) coordinate system is known (it's defined by the camera placement) • This approach is used in aerial stereo-photogrammetry using two cameras a known distance apart (e.g., on the wingtips). • Historically, a human operator would painstakingly identify matching points between the images, measure their coordinates, and compute depths via triangulation. • There is an old device called a stereoplotter that was used for this.962 Simple Cell Disparity Responses • Back to visual cortex…. • A small percentage of simple cells appear to be sensitive to disparities between the signals coming from the two retinae. • These may accept input from simple cell pairs tuned to corresponding locations on the two retinae. • Likely this is used in 3D depth perception. 963 Stereopsis is Computational random-dot stereogram • Cross your eyes …. • Or “relax” them: random-line stereogram • As proved by B. Julesz’ random dot and line stereograms. • A similar visual aid … click here. 964 An Autostereogram Remember those “Magic Eye” books? 965 The Depth Map 966 In Motion! 967 Stereo (Binocular) Geometry • Vergence angle depends on object distance. • Shown are angular and positional disparities. • The simple cells appear to respond to the positional disparities. • But the vergence signals are also sent to the depth processing areas of the brain. They are part of the oculomotor control system. 968 Disparity-Sensitive Simple Cells Spatial-only simple cells Disparity-sensitive simple cells 969 Phase-Based Stereo • Recall 2D Gabor filter model: g(x)  Ke    x 2 2   y  / 2 s2 2 p 1 u 0 x  v0 y   l  e • Find filterbank responses ti (x) = I(x)*gi(x); i = 1 ,…, K • Largest magnitude response i* = arg maxi |ti (x)| t*(x) = ti*(x) 970 Left and Right Responses • Find largest response for both left and right camera images:  f(x )  exp  1f (x )  t *L (x)  A  G L L L    f( x )   exp  1f ( x )  t *R (x)  A  G R R R   • Demodulate to obtain fL (x L ) and fR (x R )   t* (x )   fL (xL )  Cos Re  *L L     t L (xL )   1   t* (x )   fR (xR )  Cos Re  *R R     t R (xR )   1 • Stereo assumption fL(xL) - fR[xL - x(xL)] = 0 971 Left and Right Responses • Here is (truncated) 2-D Taylor’s formula: f (x 0  x, y 0  y)  f (x 0 , y0 )  f x (x 0 , y0 )x  f y (x 0 , y0 )y or f (x 0  x)  f (x 0 )  x Tf (x 0 ) • Taylor Approximation [Note: x(xL) = (x(xL), 0)] : fL (x L )  fR (x L )  Δx(x L )T fR (x L ) Algorithm! fL ( x L )  fR ( x L ) Δx( x L )  fR (x L ) 972 973 Phase-Based Stereo Example “Turf” Brighter = nearer 974 Phase-Based Stereo Example “Tree trunk” Brighter = nearer 975 Phase-Based Stereo Example “baseball” Brighter = nearer 976 Phase-Based Stereo Example “Doorway” Brighter = nearer 977 Phase-Based Stereo Example “Stone” Brighter = nearer 978 Phase-Based Stereo Example “Pentagon” Brighter = nearer 979 Case Study: Deep Stereopsis 980 Deep Stereopsis • Uses the idea of a “siamese” network: dual, parallel networks with shared weights (that are thus trained together and equal). General model, from paper here:* *Zbontar, et al., “stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, 2016. 981 Deep Stereopsis • Applied to patches by maximizing a similarity score between potential patch matches. • Trained on “Kitti,” - large dataset with ground truth depths measured by a terrestrial lidar range scanner. • Also the small Middlebury Stereo dataset. • Basic aspects: – Cross-entropy loss on binary decision match / no match – Matching cost is combined over “cross-based” neighboring windows to handle depth discontinuities – A disparity smoothness criteria modifies loss – Subpixel enhancement and post-filtering with median and bilateral filters *Zbontar, et al., “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, 2016. 982 Cross-Based Cost • Define a support region of four “arms” of each pixel p: left, right, top, down. • Example: The left arm of p are those pixels satisfying I(p)  I(p L )  D1 p  pL  D2 • Similar for the other three arms. 983 Cross-Based Cost • Consider a pixel p in the left image, and a pixel p - d in the right image, shifted by disparity d. • Let UR and UL indicate support regions in the left and right images: UL(p) is the support region at p in the left image and UR(p – d) is the support region at p – d in the right image. • The “combined” left-right support region is Ud(p) = {q: q  UL(p) , q – d  UR(p – d)}. • When training, run the Siamese networks once, and the FC layers d times, where d = maximum allowed disparity. 984 Matching Cost • The matching cost is (iteratively) averaged over the combined region: C0(p, d) = CNN(p, d) Ci(p, 1 d) = U d (p)  i 1 C (q, d) qU d ( p ) where for patches PL(p), PR(p – d): CNN(p, d) = cross-entropy[PL(p), PR(p – d)] • Perform three iterations to obtain C4(p, d) 985 Disparity Smoothness • A very old concept in computational stereo. • Additional (smoothness) cost on the disparity function D(p): 4 E (D) =  C (q)  p λ1  D(p)  D(q) qNp + λ 2 1 D(p)  D(q) 1 where qNp 1{} = binary set indicator function Np = neighborhood of p l1, l2 are weights 986 Final Depth Prediction • Optimize: D(p, d) = arg mind C(p, d) • To correct remaining errors and improve precision the authors – Interpolate to correct mismatch errors – Compute sub-pixel depth estimates by interpolation – Fix final errors by median and bilateral filtering • Feel free to read about these ad hoc refinements in the paper. 987 Examples on Kitti • For awhile this was the top model on Kitti (naturally refinements and new models have surpassed it). • But it remains a relatively simple and fundamental highperforming example of deep learning depth from stereo. 988 Examples on Kitti 989 Deep Learning In Context • Deep learning models sometimes approach human performance on a wide variety of visual tasks. • Yet they can fall apart entirely. • A very reasonable outlook on their limitations is given here. A top DNN learner was >99.6% confident of these classifications. 990 Comments • That’s all, folks! Thanks for Coming 991

Digital Image Processing Lectures: Deep Learning & Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib