Mata kuliah : T0283 - Computer Vision Tahun : 2010 Lecture 10 Pattern Recognition and Classification II Learning Objectives After carefully listening this lecture, students will be able to do the following : demonstrate the use of PCA technique in face recognition Explain a real-time robust object detection procedure developed by Viola and Jones. January 20, 2010 T0283 - Computer Vision 3 Feature Selection What features to use? How do we extract them from the image? Using images themselves as feature vectors is easy, but has problem of high dimensionality A 128 x 128 image = 16,384-dimensional feature space! What do we know about the structure of the categories in feature space? Intuitively, we want features that result in wellseparated classes January 20, 2010 T0283 - Computer Vision 4 Dimensionality Reduction Functions yi = yi(x) can reduce dimensionality of feature space More efficient classification If chosen intelligently, we won’t lose much information and classification is easier Common methods Principal components analysis (PCA): Projection maximizing total variance of data Fisher’s Linear Discriminant (FLD): Maximize ratio of between-class variance to within-class variance January 20, 2010 T0283 - Computer Vision 5 Geometric Interpretation of Covariance C Covariance C = X XT can be thought of as linear transform that redistributes variance of unit normal distribution, where zero-mean X is n (number of dimensions) x d (number of points) January 20, 2010 T0283 - Computer Vision adapted from Z. Dodds 6 Geometric Factorization of Covariance SVD of covariance matrix C = RT D R describes geometric components of transform by extracting: Diagonal scaling matrix D Rotation matrix R E.g., given points X = 2 -2 1 -1 5 -5 -1 1 , the covariance factors as X January 20, 2010 XT = 2.5 5 5 13 = cos, sin of 70 .37 -.93 15 0 .37 .93 .93 .37 0 .5 -.93 .37 T0283 - Computer Vision major, minor axis lengths “best” 2nd-best axis axis 7 PCA for Dimensionality Reduction Any point in n-dimensional feature space can be expressed as a linear combination of the n eigenvectors (the rows of R) via a set of weights [1 , 2, …, n] (this is just a coordinate system change) By projecting points onto only the first k << n principal components (eigenvectors with the largest eigenvalues), we are essentially throwing away the least important feature information January 20, 2010 T0283 - Computer Vision 8 Projection onto Principal Components Full n-dimensional space (here n = 2) January 20, 2010 k-dimensional subspace (here k = 1) T0283 - Computer Vision adapted from Z. Dodds 9 Face Recognition ? January 20, 2010 20 faces (i.e., classes), 9 examples T0283 - Computer (i.e., Vision training data) of each 10 Simple Face Recognition Idea: Search over training set for most similar image (e.g., in SSD sense) and choose its class This is the same as a 1-nearest neighbor classifier when feature space = image space Issues Large storage requirements (nd, where n = image space dimensionality and d = number of faces in training set) Correlation is computationally expensive January 20, 2010 T0283 - Computer Vision 11 Eigenfaces Idea: Compress image space to “face space” by projecting onto principal components (“eigenfaces” = eigenvectors of image space) Represent each face as a low-dimensional vector (weights on eigenfaces) Measure similarity in face space for classification Advantage: Storage requirements are (n + d) k instead of nd January 20, 2010 T0283 - Computer Vision 12 Eigenfaces: Initialization Calculate eigenfaces Compute n–dimensional mean face Compute difference of every face from mean face j = j - Form covariance matrix of these C = AAT, where A = [1, 2, …, d] Extract eigenvectors ui from C such that Cui = iui Eigenfaces are k eigenvectors with largest eigenvalues Example eigenfaces January 20, 2010 T0283 - Computer Vision 13 Eigenfaces: Initialization Project faces into face space Get eigenface weights for every face in the training set [j1 , j2, …, jk] for face j are computed via dot products ji = ui T j The weights January 20, 2010 T0283 - Computer Vision 14 Calculating Eigenfaces Obvious way is to perform SVD of covariance matrix, but this is often prohibitively expensive E.g., for 128 x 128 images, C is 16,384 x 16,384 Consider eigenvector decomposition of d x d matrix ATA: ATAvi = ivi. Multiplying both sides on the left by A, we have AA CTAvi = iAvi So ui = Avi are the eigenvectors of C = AAT January 20, 2010 T0283 - Computer Vision 15 Eigenfaces: Recognition Project new face into face space Classify Assign class of nearest face from training set Or, precalculate class means over training set and find nearest mean class face [1 , 2, …, 8] Original face January 20, 2010 8 eigenfaces T0283 - Computer Vision Weights adapted from Z. Dodds 16 Robust Real-time Object Detection by Paul Viola and Michael Jones Presentation by Chen Goldberg Computer Science Tel Aviv university June 13, 2007 About the paper Presented in 2001 by Paul Viola and Michael Jones (published 2002 – IJCV) Specifically demonstrated (and motivated by) the face detection task. Placed a strong emphasis upon speed optimization. Allegedly, was the first real-time face detection system. Was widely adopted and re-implemented. Intel distributes this algorithm in a computer vision toolkit (OpenCV). Paul viola Michael Jones January 20, 2010 T0283 - Computer Vision 18 Framework scheme Framework consists of : Trainer Detector The trainer is supplied with positive and negative samples: Positive samples – images containing the object. Negative samples – images not containing the object. The trainer then creates a final classifier. A lengthy process, to be calculated offline. The detector utilizes the final classifier across a given input image. January 20, 2010 T0283 - Computer Vision 19 Abstract detector 1. 2. 3. Iteratively sample image windows. Operate Final Classifier on each window, and mark accordingly. Repeat with larger window. January 20, 2010 T0283 - Computer Vision 20 Features We describe an object using simple functions also called: Harr-like features. Given a sub-window, the feature function calculates a brightness differential. For example: The value of a tworectangle feature is the difference between the sum of the pixels within the two rectangular regions. January 20, 2010 T0283 - Computer Vision 21 Features example Faces share many similar properties which can be represented with Haar-like features For example, it is easy to notice that: The eye region is darker than the upper-cheeks. The nose bridge region is brighter than the eyes. January 20, 2010 T0283 - Computer Vision 22 Three challenges ahead 1. How can we evaluate features quickly? Feature calculation is critically frequent. Image scale pyramid is too expensive to calculate. 2. 3. How do we obtain the best representing features possible? How can we refrain from wasting time on image background? (i.e. non-object) January 20, 2010 T0283 - Computer Vision 23 Introducing Integral Image Definition: The integral image at location (x,y), is the sum of the pixel values above and to the left of (x,y), inclusive. we can calculate the integral image representation of the image in a single pass. January 20, 2010 T0283 - Computer Vision 24 Rapid evaluation of rectangular features Using the integral image representation one can compute the value of any rectangular sum in constant time. For example the integral sum inside rectangle D we can compute as: ii(4) + ii(1) – ii(2) – ii(3) As a result: two-, three-, and four-rectangular features can be computed with 6, 8 and 9 array references respectively. Now that’s fast! January 20, 2010 T0283 - Computer Vision 25 Scaling Integral image enables us to evaluate all rectangle sizes in constant time. Therefore, no image scaling is necessary. Scale the rectangular features instead! January 20, 2010 T0283 - Computer Vision 1 2 3 4 5 6 26 Feature selection Given a feature set and labeled training set of images, we create a strong object classifier. However, we have 45,396 features associated with each image sub-window, hence the computation of all features is computationally prohibitive. Hypothesis: A combination of only a small number of discriminant features can yield an effective classifier. Variety is the key here – if we want a small number of features – we must make sure they compensate each other’s flaws. January 20, 2010 T0283 - Computer Vision 27 Boosting Boosting is a machine learning meta-algorithm for performing supervised learning. Creates a “strong” classifier from a set of “weak” classifiers. Definitions: “weak” classifier - has an error rate <0.5 (i.e. a better than average advice). “strong” classifier - has an error rate of ε (i.e. our final classifier). January 20, 2010 T0283 - Computer Vision 28 AdaBoost Stands for “Adaptive boost”. AdaBoost is a boosting algorithm for searching out a small number of good classifiers which have significant variety. AdaBoost accomplishes this, by endowing misclassified training examples with more weight (thus enhancing their chances to be classified correctly next). The weights tell the learning algorithm the importance of the example. January 20, 2010 T0283 - Computer Vision 29 AdaBoost example Adaboost starts with a uniform distribution of “weights” over training examples. Select the classifier with the lowest weighted error (i.e. a “weak” classifier) Increase the weights on the training examples that were misclassified. (Repeat) At the end, carefully make a linear combination of the weak classifiers obtained at all iterations. 1 1h1 (x) hstrong (x) 0 January 20, 2010 1 1 2 otherwise n hn (x) n Slide taken from a T0283 presentation byVision Qing Chen, Discover Lab, University of Ottawa - Computer 30 Back to Feature selection We use a variation of AdaBoost for aggressive feature selection. Basically similar to the previous example. Our training set consists of positive and negative images. Our simple classifier consists of a single feature. January 20, 2010 T0283 - Computer Vision 31 Simple classifier A Simple classifier depends on a single feature. Hence, there are 45,396 classifiers to choose from. For each classifier we set an optimal threshold such that the minimum number of examples are misclassified. hj x 1 0 p j f j x p j j else h j - Simple classifier f j - Feature j - Threshold p j - Pairity, indicating the direction of the inequality sign. January 20, 2010 T0283 - Computer Vision 32 Feature selection pseudo-code Given example images (x1,y1) , … , (xn,yn) where yi = 0, 1 for negative and positive examples respectively. Initialize weights w1,i = 1/(2m), 1/(2l) for training example i, where m and l are the number of negatives and positives respectively. For t = 1 … T 1) Normalize weights so that wt is a distribution 2) For each feature j train a classifier hj and evaluate its error j with respect to wt. 3) Chose the classifier hj with lowest error. 4) Update weights according to: w t 1,i wt ,i1ei t where ei = 0 if xi is classified correctly, 1 otherwise, and t t 1 t The final strong classifier is: 1 h( x ) 0 January 20, 2010 1 T 2 t 1 t , otherwise t 1 t ht ( x) T where T0283 - Computer Vision t log( 1 ) t 33 The Attentional Cascade Overwhelming majority of windows are in fact negative. Simpler, boosted classifiers can reject many of negative subwindows while detecting all positive instances. A cascade of gradually more complex classifiers achieves good detection rates. Consequently, on average, much fewer features are calculated per window. January 20, 2010 T0283 - Computer Vision 34 Training a Cascaded Classifier Subsequent classifiers are trained only on examples which pass through all the previous classifiers The task faced by classifiers further down the cascade is more difficult. January 20, 2010 T0283 - Computer Vision 35 Training a Cascaded Classifier (cont.) Given false positive rate F and detection rate D, we would like to minimize the expected number of features evaluated per window. Since this optimization is extremely difficult, the usual framework is to choose a minimal acceptable false positive and detection rate per layer. N n 0 n i p j i 1 j i K N ni K pi January 20, 2010 : Expected number of features evaluated per window. : The number of features in the i-th classifier. : Number of classifiers/layers. : The positive rate of the i -th classifier. T0283 - Computer Vision 36 Pseudo-Code for Cascade Trainer User selects values for f, the maximum acceptable false positive rate per layer and d, the minimum acceptable detection rate per layer. User selects target overall false positive rate Ftarget. P = set of positive examples N = set of negative examples F0 = 1.0; D0 = 1.0; i = 0 While Fi > Ftarget i++ ni = 0; Fi = Fi-1 while Fi > f x Fi-1 o ni ++ o Use P and N to train a classifier with ni features using AdaBoost o Evaluate current cascaded classifier on validation set to determine Fi and Di o Decrease threshold for the ith classifier until the current cascaded classifier has a detection rate of at least d x Di-1 (this also affects Fi) N= If Fi > Ftarget then evaluate the current cascaded detector on the set of non-face images and put any false detections into the set N. January 20, 2010 T0283 - Computer Vision 37