Does one size really fit all? Evaluating classifiers in Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute Agenda 1. Content-based Image Classification – Motivation 2. Bag-of-Visual-Words 3. Bag-of-Visual-Words Classification ■ Classifier Evaluation ■ Model Visualization 4. Conclusion Content-based Image Classification Find all photos that show ... ! en t e r co n c yo u r f av or ep t h er e . . i t e . Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 3 Content-based Image Classification (2) Training: ■ Positive images: (that depict a concept) ■ Negative images: (that don’t) Classification: ■ Test image if it depicts concept (or not): Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 4 Bag-of-Visual-Words ■ Origin - text classification □ e.g. Task: classify forum posts into “insult” (positive) and “not insult” (negative) "haha... at least get your insults straight you idiot!!. ..." D1 { "You're one of my favorite commenter s." D2 } “idiot”: 1, “favorite”: 2, “to”: 3, “you”: 4, “at”: 5, “least”: 6, “commenter”: 7, … D1 [1, 2, 1, 1, 2, 0, 0,…] D2 [1, 1, 1, 1, 0, 1, 1,…] Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 5 Bag-of-Visual-Words (2) ■ Learn a decision rule (e.g. linear SVM) Feature weights □ i.e. learn features weights Does one size really fit all? Christian Hentschel, 09-18-2014 [Adopted from A. Mueller, https://github.com/amueller/ml-berlin-tutorial] Chart 6 Bag-of-Visual-Words (3) ■ Examples for Visual Words Examples p for visual words Airplanes Motorbikes Faces Wild Cats Leaves Does one size really fit all? People Christian Hentschel, 09-18-2014 Bikes [Schmid, 2013] Chart 7 Bag-of-Visual-Words (4) Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 8 Bag-of-Visual Words Classification ■ De-facto standard: kernel-based Support Vector Machines □ Decision rule: □ Kernel-Function: □ Distance metric: Does one size really fit all? Mo d el Christian Hentschel, 09-18-2014 Chart 9 Bag-of-Visual Words Classification (2) ■ Testing different classification models □ Average Precision (AP, area under Precision Recall Curve) ■ Test Dataset □ Caltech-101 – 100 + 1 object classes – 31 – 800 images per class ■ Tested Classifiers: □ Naïve Bayes, K-NN, Logistic Regression □ SVM: linear SVM, RBF kernel SVM, Chi2-kernel SVM Does one size really fit all? □ Ensemble Methods:Random Forest, AdaBoost Christian Hentschel, 09-18-2014 □ Hyper parameters optimized in grid-search using CV Chart 10 Bag-of-Visual Words Classification – Results ■ Mean AP scores over all classes: 0.67 Chi2-Kernel SVM 0.63 AdaBoost 0.61 Random Forest 0.59 RBF kernel SVM linear SVM 0.55 Logistic Regression 0.55 k NN Naive Bayes 0.52 0.48 Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 12 Bag-of-Visual Words Classification – Results (2) ■ mAP-scores between best (Chi2-SVM) and worst (Naïve Bayes): 0.19 □ Poor performance of Naïve Bayes and k-NN – but fast training ■ Superior performance of kernel-based SVM, but: □ Kernel function (Chi2 vs. Gaussian RBF) is crucial: – Ensemble methods outperform Gaussian RBF – Gaussian RBF only slightly better than linear SVM □ increased evaluation time: – complex kernel function between each SV and a testing example Does one size really fit all? – ensemble method reduce classification time Christian Hentschel, 09-18-2014 Chart 13 Bag-of-Visual Words Classification – Results (3) ■ Correlation between training sets size and average Precision: Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 14 Bag-of-Visual Words Classification – Results (4) ■ Outliers: □ “minaret” □ “leopards” Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 15 Bag-of-Visual Words Classification – Model Visualization ■ Visualize impact of individual image regions on classification result Local Region Descriptor BoVW Vector Feature Weights □ Use ensemble methods – No kernel function – AdaBoost: direct indicator for feature importance: mean decrease in impurity Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 16 “minaret” ■ “leopards” Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 17 ■ “minaret” Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 18 ■ “car_side” Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 19 ■ “watch” Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 20 Conclusion ■ Kernel-based SVM are best choice when aiming for accuracy □ Kernel function is crucial □ Evaluation time-cost is high ■ Ensemble methods are second-best winner □ Fast evaluation □ Offer intuitive visualization of model parameters ■ Visual analytics reveal deficiencies in datasets □ Improperly chosen training data affects classification results Does one size really fit all? Christian Hentschel, 09-18-2014 Chart 21 Thank you for your attention! Christian Hentschel, Harald Sack Hasso Plattner Institute