The Role of Features, Algorithms and Data in Visual Recognition Reporter: WuBin Data: 2011.04.29 Outline Authors Abstract Problem Experiment & Result Discussion Conclusion Outline Authors Abstract Problem Experiment & Result Discussion Conclusion Authors (Devi Parikh) Research Assistant Professor at Toyota Technological Institute (TTI) at Chicago Research interests Education computer vision machine learning pattern recognition Ph.D. : Carnegie Mellon University (2009) MS.: Carnegie Mellon University (2007) BS.: Rowan University (2005) Publications CVPR’11(2), CVPR’10(1), CVPR’09(1), CVPR’08(1), ECCV,08(1), ICCV’07(1), CVPR’07(Best Paper )… Cornell University (2009) Microsoft Research (twice) Intel Research (once) Authors (C. Lawrence Zitnick) Microsoft Research, Redmond Research Interests Education Color Image Single Image Stereo Matching PhD student, Robotics Institute Carnegie Mellon University Publications CVPR’10(2), ACM Trans. Graph(2), ECCV’10(2), CVPR’09(2), CVPR’08(2), ECCV,08(2), PAMI’08(1), IJCV’07(1), CVPR’06(1), ECCV,06(1), ICCV’05(1), PAMI’00(1)… Outline Authors Abstract Problem Experiment & Result Discussion Conclusion Abstract Which factors is critical to humans’ superior performance on visual (scene and object) recognition Learning algorithm Amount of training data Features representations In this work we take a small step towards this goal by performing a series of human studies and machine experiments We find no evidence that human pattern matching algorithms are better than standard machine learning algorithms We find that humans don’t leverage increased amounts of training data Through statistical analysis on the machine experiments and supporting human studies, we find that the main factor impacting accuracies is the choice of features 摘要 在视觉识别(场景以及物体)领域存在很多基于计算机视觉的相 关算法。为了取得更好的识别效果,有的系统着眼于复杂的学习 算法,一些则利用大量的训练数据,还有一些考虑对更有效的特 征进行建模。然而遗憾的是,所有这些系统都远远无法达到人类 的识别能力。如果我们了解了人类在视觉识别上的响应方式,那 么就能对上述三种方式的有效性产生更深刻的认识,从而发现究 竟是什么造就了人类优越的识别能力。 本文通过对人类学习和机器学习的一系列实验,朝着这个方向前 进了一小步。我们发现,没有任何证据证明人类的学习算法要优 于标准的机器学习算法。另外,人类也不依赖于增加大量的训练 样本来提高识别能力。在本文实验的基础上,通过统计分析发现, 影响识别精度的最重要的因素在于特征的选择。 Outline Authors Abstract Problem Experiment & Result Discussion Conclusion Problem What makes humans so much better at these tasks than today’s machines? Outline Authors Abstract Problem Experiment & Result Discussion Conclusion Experiment Machine experiments Classifiers Datasets Feature types Dimensionality Proportion of noisy features Number of training instances Human studies Machine experiments (1) Classifiers NN: nearest neighbor NCM: nearest class-mean LSVM: linear SVM QSVM: SVM with a quadratic polynomial kernel CSVM: SVM with a cubic polynomial kernel RBFSVM: SVM with an Radial Basis Function (RBF) kernel DT: decision tree NET: a multi-layer perceptron neural network with 1 hidden layer and 20 hidden layer nodes BOOST: boosting with linear SVM on individual features as the simple learners LDASVM: Principal Component Analysis (PCA) then Linear Discriminant Analysis (LDA) followed by a linear SVM classifier Machine experiments (2) Datasets OSR: eight categories (coast, forest, highway, inside-city, mountain, open-country, street, tall-building) of outdoor scene recognition dataset ISR: eight categories (bathroom, bedroom, dining room, gym, kitchen, living room, movie theater and stairs) from the indoor scene recognition dataset PA1: eight categories (bird, bottle, cat, dog, horse, person, pottedplant, sheep) from the PASCAL object recognition dataset PA2: eight other categories (aeroplane, bicycle, boat, chair, car, dining table, motorbike, sofa) from the same PASCAL object recognition dataset CAL: six categories (aeroplane, car-rear, face, ketch, motorbike, watch) from the Caltech-101 object categories dataset Machine experiments (3) Feature types CH: color histogram computed by assigning all pixels in an image to a pre-computed universal color dictionary computed using k-means TH: texture histogram computed over a discretization of multiscale edge orientations in the image GIST: gist descriptor BOW: bag-of-words feature descriptor for the CAL dataset ATT: binary attributes, indicate whether the objects have certain higher-level attributes such as being round, or furry, or having a head, etc. Machine experiments (4) Dimensionality CH: {4; 8; 16; 32; 64; 128; 256} GIST: vary the number of edge-orientations (osi ) at each of the three scales ([s1; s2; s3]), as well as the number of spatial blocks (n x n) the image is divided into. TH: vary the number of orientation bins across the three scales to obtain descriptors of different dimensionality BOW: the dimensionality was kept fixed at 200 by using a dictionary with 200 SIFT codewords ATT: the dimensionality was also kept fixed at 64 for PA2, while PA1 used a 32 bit version in addition to the 64 bit one by dropping the attributes that were almost always set to zero across the dataset Machine experiments (5) Proportion of noisy features {0%; 25%; 50%; 100%; 200%} 200% indicates that twice the number of original features are added as noisy features Gaussian distribution with the same mean and standard deviation Machine experiments (6) Number of training instances Vary the number of training instances used per category in the range {2; 4; 8; 16; 32; 64; 100(88 for CAL)} Human Studies (1) To prevent the use of prior knowledge about images by the subjects, we do not display to them any direct image information such as texture patches or color. Instead, we use abstracted visual patterns as stimuli. subjects performed at 34% for Figure 3(a), 47% for (b), 50% for (c) and 47% for (d) Human Studies (2) Result The Role of Algorithms The Role of Data The Role of Features The Role of Algorithms Fix the set of input features and training data for each set of experiments The learning algorithm used by humans is not superior to state-of-theart techniques on these types of problems The Role of Data (1) The machine experiments show consistent improvement as the number of training instances increase It is unlikely that machine accuracies will match those of humans when given the original images Humans may not be as capable at leveraging large amounts of training data for pattern matching Humans are very capable of generalizing from a small number of training examples The Role of Data (2) For machine experiments significantly more training examples are needed to achieve similar levels of accuracy Linear SVMs and NCM are less sensitive to noise Humans are also susceptible to data noise The Role of Features (1) Edge and gradient based features typically out perform color based features across the various datasets Humans are also known to be very sensitive to edge or contour information We perform human studies on recognition using the same features and training sets as the machine experiments, and they show similar treads The feature set is critical for recognition The Role of Features (2) Humans are shown natural images from the outdoor scene recognition dataset under different transformations and asked to select a category name from a list Don’t provide subjects with any training data Two transformations Block test The image is divided into non-overlapping blocks, and the pixels in each block are randomly shuffled Maintains the global layout of the scene, but the local statistics are lost Puzzle-test The image is divided into non-overlapping blocks, but the blocks are randomly shuffled in the image while maintaining the pixels’ relative locations in the block Local regions of the image are preserved while the global layout is not. The Role of Features (3) In both high and low resolution images, human recognition is robust to a significant loss of local statistics. This indicates that humans rely on the global layout of the scene for scene recognition In high resolution images, human recognition rates are also very robust even when the global layout of the scene is drastically altered, which indicates that humans can also rely on local regions of images The Role of Features (4) Humans do not rely on a fixed set of features. Depending on the information available to them, humans can adaptively rely on different sets of features during testing. This is true even if similar instances have never been seen before. This ability to adapt during testing is not seen in standard machine learning algorithms Outline Authors Abstract Problem Experiment & Result Discussion Conclusion Discussion Discussion The notion of features goes beyond the choice between colors, texture, the need for spatial information, etc. It includes the concepts of incorporating semantic attributes that are shared across categories Perhaps what makes the human feature representation so powerful is that these feature representations are tuned for high performance at a variety of tasks It is important to note that in addition to visual features, humans leverage prior knowledge from several non-visual higher-level and semantic features about how the world we live in functions Outline Authors Abstract Problem Experiment & Result Discussion Conclusion Conclusion In this paper we study human responses on visual recognition problems as posed to machines, to gain insight into which of the three factors is critical to humans’ superior performance. learning algorithm amount of training data features We find no evidence that human pattern matching algorithms are better than standard machine learning algorithms. Moreover, we find that humans don’t leverage increased amounts of training data. We thus hypothesize with the aid of ANOVA analysis that features are the main factor contributing to superior human performance. Future work involves extensive studies to identify which visual features humans rely on to aid in the development of novel machine recognition algorithms.