Classification ECE 847: Digital Image Processing Stan Birchfield Clemson University Acknowledgment Many slides are courtesy of Frank Dellaert and Jim Rehg at Georgia Tech from http://www-static.cc.gatech.edu/classes/AY2007/cs4495_fall/html/materials.html Classification problems • Detection – Search set, find all instances of class • Recognition – Given instance, label its identity • Verification – Given instance and hypothesized identity, verify whether correct • Tracking – Like detection, but local search and fixed identity Classification issues • Feature extraction – needed for practical reasons; distinction is somewhat arbitrary: – Perfect feature extraction classification is trivial – Perfect classifier no need for feature extraction • occlusion (missing features) • mereology – study of part/whole relationships POLOPONY, BEATS (not BE EATS) • segmentation – how can we classify before segmenting? how can we segment before classifying? • context • computational complexity: 20x20 binary input is 10120 patterns! Mereology example What does this say? Decision theory • Decision theory – goal is to make a decision (i.e., set a decision boundary) so as to minimize cost • Pattern classification is perhaps most important subfield of decision theory • Supervised learning: features, data sets, algorithm decision boundary Overfitting Could separate perfectly using nearest neighbors But poor generalization (overfitting) – will not work well on new data decision boundary Occam’s razor – The simplest explanation is the best (Philosophical principle based upon the orderliness of the creation) Bayes decision theory Problem: Given a feature x, determine the most likely class: w1 or w2 1 class-conditional pdfs 0 Easy to measure with enough examples Bayes’ rule likelihood (class-conditional pdf) prior evidence (normalization factor) posterior 1 0 1 0 What is this P(w1|x) ? • Probability of class 1 given data x 1.0 P(w1|x) P(w2|x) ? 0.0 P(w1|x)+P(w2|x)=1 ! Note: Area under each curve is not 1 x Bayes Classifier • Classifier: Select • Decision boundaries occur where 1.0 P(1|x) P(2|x) 0.0 select w2 select w1 select w2 Bayes Risk The total risk is the expected loss when using the classifier: where (We’re assuming loss is constant here) 1.0 P(1|x) P(2|x) 0.0 The shaded area is called the Bayes risk Discriminative vs. Generative Finding a decision boundary is not the same as modeling a conditional density. Note: Bug in Forsyth-Ponce book: P(1|x)+P(2|x) != 1 Histograms • One way to compute classconditional pdfs is to collect a bunch of examples and store a histogram • Then normalize Application: Skin Histograms • Skin has a very small range of (intensity independent) colours, and little texture – Compute colour measure, check if colour is in this range, check if there is little texture (median filter) – See this as a classifier - we can set up the tests by hand, or learn them. – get class conditional densities (histograms), priors from data (counting) • Classifier is Finding skin color 3D histogram in RGB space M. J. Jones and J. M. Rehg, Statistical Color Models with Application to Skin Detection, Int. J. of Computer Vision, 46(1):81-96, Jan 2002. Histogram skin non-skin Results Note: We have assumed that all pixels are independent! Context is ignored Confusion matrix true positive = hit false negative = miss = false dismissal = Type II error • sensitivity = true positive rate = hit rate = recall TPR = TP / (TP+FN) • false negative rate FNR = FN / (TP+FN) TPR + FNR = 1 false positive = false alarm = false detection = Type I error • false positive rate = false alarm rate = fallout FPR = FP / (FP+TN) • specificity SPC = TN / (FP+TN) FPR + SPC = 1 Receiver operating characteristic (ROC) curve TPR equal error rate (EER) = 88% FPR confusion matrix for image classifier: Cross-validation Naïve Bayes • Quantize image patches, then compute a histogram of patch types within a face • But histograms suffer from the curse of dimensionality • Histogram in N dimensions is intractable with N>5 • To solve this, assume independence among the pixels • Features are the patch types P(image|face) = P(label 1 at (x1,y1)|face)...P(label k at (xk,yk)|face) Histograms applied to faces and cars H. Schneiderman, T. Kanade. "A Statistical Method for 3D Object Detection Applied to Faces and Cars". IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000) Alternative: Kernel density estimation (Parzen windows) K/N is fraction of samples that fall into volume V Parzen windows • Non-parametric technique • Center kernel at each data point, sum results (and normalize) to get pdf Parzen windows Gaussian Parzen Windows Parzen Window Density Estimation Comparison Histograms • non-parametric • smoothing parameter = # of bins • discard data afterwards • discontinuous • boundaries arbitrary • d dimensions Md bins (curse of dimensionality) Parzen windows • non-parametric • smoothing parameter = size of kernel • need data always • discontinuous (box) or continuous (Gaussian) • boundaries data driven (box) or no boundaries (Gaussian) • dimensionality not as much of a curse Another alternative: Locally Weighted Averaging (LWA) • Keep instance database • At each query point, form locally weighted average f(i) = 1 for positive examples, 0 for negative examples • Equivalent to Parzen windows • memory based, lazy learning, applicable to any kernel, can be slow LWA Classifier, Circular Kernel Data, 2 classes Kernel Weights All Data LWA Posterior K-Nearest Neighbors Classification = majority vote of K nearest neighbors Recognition by finding patterns • We have seen very simple template matching (under filters) • Some objects behave like quite simple templates – Frontal faces • Strategy: – Find image windows – Correct lighting – Pass them to a statistical test (a classifier) that accepts faces and rejects non-faces Finding faces • Faces “look like” templates (at least when they’re frontal). • General strategy: • Issues – How corrected? – What features? – What classifier? – search image windows at a range of scales – Correct for illumination – Present corrected window to classifier test image training database training image feature extraction classifier decision learner Face detection http://ocw.mit.edu/NR/rdonlyres/Brain-and-Cognitive-Sciences/9-913Fall-2004/B89E6E21-3DDA-4E70-9107-C66F7B8C7DED/0/class1_2_2004.pdf Face recognition http://ocw.mit.edu/NR/rdonlyres/Brain-and-Cognitive-Sciences/9-913Fall-2004/B89E6E21-3DDA-4E70-9107-C66F7B8C7DED/0/class1_2_2004.pdf Linear discriminant functions • • • • g(x) = wTx+w0 decision surface is hyperplane w is perpendicular to hyperplane neural network: combination of linear discriminant functions • sigmoid function is differentiable, enables backpropagation Neural networks for detecting faces Henry A. Rowley, Shumeet Baluja, and Takeo Kanade, Neural Network-Based Face Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 20, number 1, pages 23-38, January 1998. Neural networks for detecting faces positive training images: scaled, rotated, translated, and mirrored negative training images Neural networks for detecting faces Arbitration Bootstrapping • Hardest examples to classify are those near the decision boundary • These are also the most useful for training • Approach: Run detector, find examples of misclassification, feed back into training process Results Real-time face detection • Components – Cascade architecture – Box sum features (integral image) H1 H2 Non-face Hn Non-face Face Viola and Jones, CVPR 2001 Haar-like features (Integral image makes computation fast) More features Example •Feature’s value is calculated as the difference between the sum of the pixels within white and black rectangle regions. f i Sum(r i, white ) Sum(r i, black ) 1 if f i threshold hi ( x) 1 if f i threshold Boosting Adaboost F sign ( w1h1 w2 h2 ... wn hn ) 1 if f i i where , hi ( x) 1 if f i i The more distinctive the feature, the larger the weight. Training images Results Training Viola-Jones Direct Feature Selection (two orders of magnitude faster) Jianxin Wu, James M. Rehg, Matthew D. Mullin. Learning a Rare Event Detection Cascade by Direct Feature Selection, NIPS 2003. Using OpenCV detector 1. 2. 3. 4. 5. 6. Collect a database of positive samples and a database of negative samples. Mark object by objectmarker.exe Build a vec file out of positive samples using createsamples.exe Run haartraining.exe to build the classifier. Run performance.exe to evaluate the classifier. Run haarconv.exe to convert classifier to .xml file Using OpenCV detector 1. 2. Mark positive samples: info.txt Use createsamples,exe to pack the positive samples into “hw.vec” file. createsamples –info info.txt –vec hw.vec –w 15 –h 12 3. (The minimum size of marked object was 15 by 12) Use haartraining.exe to train the classifier. haartraining –data hw –vec hw.vec -bg background.txt – mem 100 –w 15 –h 12 –nstages 18 4. 5. Convert classifier to xml. Convert hw hw.xml 15 12. Use performance.exe to check the performance. performance –dada hw.xml –info.txt –w 15 –h 12 –ni 6. Use PatternDetector class in Blepo to display the results m_Detector = new PatternDetector(xml_file_name); 7. In the results, you will see a object detected twice or more, with overlap. from Zhichao Chen Using OpenCV detector Result from checking performance: Here you can see that the classifier detected 469 positive objects and missed 36. The false positive is bigger(1991), because • A positive object might be detected many times and the positions are slightly different. Some “good” detections are regarded as “false” • We only used 18 stages . More stages would reduce the false positives, at the expense of more training time. • No background image was included for training. Conclusions: • Use the proper sample size for training. Basically, the sample size should be similar to the minimum size of the marked object. from Zhichao Chen • If the FPR is too high, increase the number of stages. OpenCV detector links • Original Viola-Jones paper: http://research.microsoft.com/~viola/Pubs/Detect/violaJones_CV PR2001.pdf • OpenCV library: http://sourceforge.net/projects/opencvlibrary • How-to build a cascade of boosted classifiers based on Haarlike features: http://lab.cntl.kyutech.ac.jp/~kobalab/nishida/opencv/OpenCV_O bjectDetection_HowTo.pdf • Objectmarker.exe and haarconv.exe, *.dll: http://www.iem.pw.edu.pl/~domanskj/haarkit.rar from Zhichao Chen Fisher linear discriminant http://ocw.mit.edu/NR/rdonlyres/Brain-and-Cognitive-Sciences/9-913Fall-2004/B89E6E21-3DDA-4E70-9107-C66F7B8C7DED/0/class1_2_2004.pdf Linear SVMs http://ocw.mit.edu/NR/rdonlyres/Brain-and-Cognitive-Sciences/9-913Fall-2004/B89E6E21-3DDA-4E70-9107-C66F7B8C7DED/0/class1_2_2004.pdf Non-linear SVMs http://ocw.mit.edu/NR/rdonlyres/Brain-and-Cognitive-Sciences/9-913Fall-2004/B89E6E21-3DDA-4E70-9107-C66F7B8C7DED/0/class1_2_2004.pdf Eigenfaces