Robust Classification of Objects, Faces, and Flowers Using Natural Image Statistics 主讲人:王崇秀 Outline Authors Abstract Background Framework and Implementation Experiments and Results Conclusions 2015/4/13 2 Authors Christopher Kanan: 2015/4/13 Ph.D. student at the University of California, San Diego (UCSD), and intend to graduate in 2012. Research Interests: fuses findings and methods from computer vision, machine learning, psychology, and computational neuroscience. Homepage: http://cseweb.ucsd.edu/~ckanan/index.html Email: ckanan@cs.ucsd.edu 3 Authors Garrison Cottrell: 2015/4/13 Professor in the Computer Science & Engineering Department at UCSD . Research: His research is strongly interdisciplinary. It concerns using neural networks as a computational model applied to problems in cognitive science and artificial intelligence, engineering and biology. He has had success in using them for such disparate tasks as modeling how children acquire words, studying how lobsters chew, and nonlinear data compression. 4 Outline Authors Abstract Background Framework and Implementation Experiments and Results Conclusions 2015/4/13 5 Abstract Classification of images in many category datasets has rapidly improved in recent years. However, systems that perform well on particular datasets typically have one or more limitations such as a failure to generalize across visual tasks (e.g., requiring a face detector or extensive retuning of parameters), insufficient translation invariance, inability to cope with partial views and occlusion, or significant performance degradation as the number of classes is increased. Here we attempt to overcome these challenges using a model that combines sequential visual attention using fixations with sparse coding. The model’s biologically-inspired filters are acquired using unsupervised learning applied to natural image patches. Using only a single feature type, our approach achieves 78.5% accuracy on Caltech-101 and 75.2% on the 102 Flowers dataset when trained on 30 instances per class and it achieves 92.7% accuracy on the AR Face database with 1 training instance per person. The same features and parameters are used across these datasets to 6 2015/4/13 illustrate its robust performance. 摘要 最近在很多分类数据集上,图像的分类性能在快速的提升。但是,在 某一特定数据集上性能很好的系统往往有一个或者多个限制,例如在 视觉任务中难以推广(需要一个人脸检测器或者额外的参数返回),平 移不变性不足,不能处理局部遮挡以及随着类别的增多,性能显著下 降。 在这里,我们试图使用一个模型来克服这些挑战,该模型结合了顺序 视觉注意中的稀疏编码的视点。该模型的生物启发的滤波器是在自然 图像块上通过无监督学习得到。仅使用一种特征,每类使用30个样本 来训练,该方法在caltech101上达到78.5%的识别率;在102类的花 数据库上达到75.2%的识别率。每个人使用1个训练样本,该方法在 AR人脸数据库上达到92.7%的识别率。在这些数据机上,使用的特征 和参数都是一致的,展示了该方法的鲁棒性。 2015/4/13 7 Outline Authors Abstract Background Framework and Implementation Experiments and Results Conclusions 2015/4/13 8 Background——Using Natural Image Statistics Hand-designed features: Self-taught learning: Haar, DOG, Gabor, HOG, SIFT and so on; Applied to unlabeled natural images to learn basis vectors/filters that are good for representing natural images. The training data is generally distinct from the datasets the system will be evaluated on. Self-taught learning works well because it represents natural scenes efficiently, while not overfitting to a particular dataset. Sparse coding 2015/4/13 9 Background——Visual Attention Visual Attention 2015/4/13 A saliency map is a topologically organized map that indicates interesting regions in an image based on the spatial organization of the features and an agent’s current goal. Computational model: There are many computational models, typically produce maps that assign high saliency to regions with rare features. 10 Background——Sequential Object Recognition Sequential Object Recognition 2015/4/13 Many algorithms for saliency maps have been used to predict the location of human eye movements, little work has been done on how they can be used to recognize individual objects. a few notable exceptions [1, 23, 27, 15] and these approaches have several similarities. Framework: extract features -> saliency maps based on features -> extract small window representing a fixation and classify these fixations to made subsequent fixations -> mechanisms used to combine information across fixations. NIMBLE framework 11 Outline Authors Abstract Background Framework and Implementation Experiments and Results Conclusions 2015/4/13 12 Framework and Implementation High level description of the model: 2015/4/13 Pre-processing image to cope with luminance variation. Sparse ICA features are then extracted from the image. Sparse ICA features are used to compute a saliency map, which is treated as a probability distribution, and locations are randomly sampled from the map. Fixations are extracted from the feature maps at the sampled location, followed by probabilistic classification. 13 Framework and Implementation Image pre-processing: 2015/4/13 Resizing to ensure smallest dimension is 128 with other dimension resized accordingly to maintain its aspect ratio. Grayscale images are converted to color. RGB → LMS: is a color space represented by the response of the three types of cones of the human eye, named after their responsivity (sensitivity) at long, medium and short wavelengths. Normalization to [0,1]: where 0.05 , rlinear ( z) [0,1] is a pixel of the image in LMS color space at location z. Note that rnonlinear( z ) [0,1] as well. 14 Framework and Implementation Image pre-processing: 2015/4/13 15 Framework and Implementation Feature learning: 2015/4/13 To learn ICA filters, we preprocess 584 images from the McGill color image dataset. From each image, 100 b*b*3 patches are extracted from random locations. The channel mean (L, M, and S) computed across images is subtracted from each patch. Each patch is then treated as a 3b^2 dimensional vector. PCA is applied to the patch collection to reduce the dimensionality (discard the first principal component, retain the rest d principal components). Apply fastICA → d ICA filters. m*n*3 images → m*n*d filter responses, sparse representation. 16 Framework and Implementation Feature learning: 2015/4/13 the ICA filters learned. 17 Framework and Implementation Saliency Maps: 2015/4/13 Use SUN model to generate saliency map. 18 Framework and Implementation Spatial Pooling: 2015/4/13 Normalize saliency map to sum to one, and then treated as a probability distribution. Randomly sampled T times, during each fixation t, a location lt is chosen according to the saliency map. w*w*d(w=51) stack of filter responses. Reduce the dimension of the stack by spatial subsampling it using a spatial pyramid, which divide each w*w filter responses into 1*1, 2*2, 4*4 grids, and the mean filter responses in each grid cell is computed and concatenated to form a vector, and normalized to unit length. This reduces the dimensionality of the from w*w*d(51^2d) to 21d. lt is normalized by the height and width of the image and stored along with the corresponding features. 19 Framework and Implementation Spatial Pooling: 2015/4/13 After acquiring T fixations from every training image, PCA is applied to the collected feature vectors. The first 500 principal components are retained, and then whitened. Finally, the post-PCA fixation features, denoted wk ,i , are each made unit length. 20 Framework and Implementation Training and Classification Naïve Bayes’ assumption: gt is the vector of fixation features. Bayes’ rule: P (C = k) is uniform and we fix T = 100, which would be about 30 s of viewing time for a person assuming 3 fixations/second. 2015/4/13 21 Outline Authors Abstract Background Framework and Implementation Experiments and Results Conclusions 2015/4/13 22 Experiments and Results Caltech101 results 2015/4/13 23 Experiments and Results Caltech256 results 2015/4/13 24 Experiments and Results AR face database 2015/4/13 25 Experiments and Results 102 Flower database 2015/4/13 26 Outline Authors Abstract Background Framework and Implementation Experiments and Results Conclusions 2015/4/13 27 Conclusions One of the reasons we think our approach works well is because it employs a nonparametric exemplar-based classifier. The Naïve Bayes’ assumption is obviously false and learning a more flexible model could lead to performance improvements. 2015/4/13 28