New Features and Insights for Pedestrian Detection Stefan Walk, Nikodem Majer, Konrad Schindler, Bernt Schiele 1 Outline • • • • • • Authors Abstract Main contributions Algorithms Experiments Conclusion 2 Authors (1/4) • Stefan Walk – Experience • 2007-, PhD Candidate in Computer Science, Technische Universität Darmstadt • 2003-2007, Diploma in Physics, Technische Universität Darmstadt, Germany 2007 – Research interest • People Detection • Detecting from video data (utilizing motion information) – Papers • Multi-cue Onboard Pedestrian Detection (CVPR09) 3 Authors (2/4) • Nikodem Majer – Experience • 2007-, PhD Candidate in Computer Science, Technische Universität Darmstadt – Research interest • … – Papers • … 4 Authors (3/4) • Konrad Schindler – Experience • 2009-: assistant professor, TU Darmstadt, Germany • 2007-2008: post-doc, ETH Zurich • 2004-2006: post-doc, Monash University, Melbourne/Australia • 2001-2003: research assistant, Graz University of Technology, Austria – Research interest • computer vision (3D scene analysis, biologically inspired vision, tracking) • image processing, pattern recognition, machine learning, photogrammetry – Papers • PAMI10, CVPR10, ICCV10… 5 Authors (4/4) • Bernt Schiele – Experience • 1999-2004, Assistant Professor, ETH Zurich, Switzerland • 1997-2000, Postdoctoral Associate and Visiting Assistant Professor, MIT and Cambridge, MA, USA • 1994, Visiting researcher at CMU • AE of PAMI, IJCV, AC of ECCV’08, CVPR’09, ICCV’09, PC of ICCV 2011 – Research interest • Perceptual computing, human-computer interfaces – Papers • … 6 Outline • • • • • • Authors Abstract Main contributions Algorithms Experiments Conclusion 7 Abstract (1/2) • Despite impressive progress in people detection the performance on challenging datasets like Caltech Pedestrians or TUD-Brussels is still unsatisfactory • In this work we show that motion features derived from optic flow yield substantial improvements on image sequences, if implemented correctly—even in the case of low-quality video and consequently degraded flow fields • Furthermore, we introduce a new feature, self-similarity on color channels, which consistently improves detection performance both for static images and for video sequences, across different datasets. In combination with HOG, these two features outperform the state-of-theart by up to 20%. 8 Abstract (2/2) • Finally, we report two insights concerning detector evaluations, which apply to classifier-based object detection in general • First, we show that a commonly under-estimated detail of training, the number of bootstrapping rounds, has a drastic influence on the relative (and absolute) performance of different feature/classifier combinations • Second, we discuss important intricacies of detector evaluation and show that current benchmarking protocols lack crucial details, which can distort evaluations 9 Outline • • • • • • Authors Abstract Main contributions Algorithms Experiments Conclusion 10 Main contribution • First, we introduce a new feature based on selfsimilarity of low level features, in particular color histograms from different sub-regions within the detector window • The second main contribution is to establish a standard what pedestrian detection with a global descriptor can achieve at present, including a number of recent advances which we believe should be part of the “best practice”, but have not yet been included in systematic evaluations • Our third main contribution are two important insights that apply not only to pedestrian detection, but more generally to classifier-based object detection. (1)Bootstrapping is very important. (2)The existing evaluation protocol is insufficient 11 Outline • • • • • • Authors Abstract Main contributions Algorithms Experiments Conclusion 12 Outline • 本文的风格与该实验室文章一贯的风格类似 – 在自己提出的两个数据库上(Caltech Pedestrian, TUD-Brussel)测试 当前人体检测领域不同的特征与不同的分类器,评价这些算法的优 劣(性能越高的算法关注度越高) – 自己提出新特征并通过实验给出结论——“在原始方法的基础上引 入我们的特征可以进一步提升人体检测系统的性能” • Related Features – Haar-like, VJ 2001年成功用于人脸检测领域 – HOG (Histogram of Oriented Gradient), Dalal 2005年成功用于人 体检测领域 – HOF (Histogram of Flow), Dalal 2006年提出,应用于视频人体检测 – HOG-LBP 王晓宇 2009年应用于人体检测领域,高性能 – CSS (Color Self-similarity), 本文提出 • Related Classifiers – SVM – MPLBoost (Multiple Pose Boosting), Dollar 2008年提出 13 Haar-like feature (1/2) • Haar-like feature – 图像内部特定模式的两个矩型内部像素和之差 – 采用积分图可以快速计算Haar特征响应值 • Haar特征的变种 – 45, 22.5, 11.25度…,仍然受限于“矩形” – 任意多边形区域形状的Haar特征(CVPR10) 传统Haar特征 Haar特征的积分图计算 14 Haar-like feature (2/2) • 任意形状的Haar特征 – 任意多边形区域的像素和可以等价为一系列梯形区域的像素和 – 梯形区域的像素和等价于两个直角三角形的像素差 – 算法关键是计算直角三角形区域的积分图,参数(x,y,斜率) 15 HOG feature (1/1) • HOG feature-梯度方向直方图 – 输入图像的Gamma校正 – 计算输入图像各像素的梯度幅值与方向 – 梯度幅值高斯加权,使用三线形插值计算各个单元梯度方向的直方 图 – 相邻的单元直方图归一化得到最终的特征向量 HOG特征计算流程 HOG特征的三线性插值 16 HOF feature (1/1) • HOF feature-光流直方图 – 计算输入图像的x、y方向的光流 (例如LK算法等等) – 对于特定区域对,根据对应像素点的x、y方向光流差异,计算光流 梯度幅值与方向 – 根据光流梯度方向使用光流梯度幅值构建直方图 Original 3x3 IMHwd (Internal Motion Boundary wavelet diff.) 17 HOG-LBP (1/1) • HOG-LBP feature:将HOG与LBP串联起来 – HOG:将三线性插值与高斯加权替换为卷积 – LBP (Local Binary Pattern):局部区域的二值模式 – 该特征在INRIA人体数据库上取得了迄今为止的最好结果 LBP特征示意 18 CSS (1/1) • CSS feature:颜色自相似度 – 对于8x8的图像区域,采用三线性插值计算颜色直方图 – We experimented with different color spaces, including 3x3x3 histograms in RGB, HSV, HLS and CIE Luv space, and 4x4 histograms in normalized rg, HS and uv, discarding the intensity and only keeping the chrominance. Among these, HSV worked best, and is used in the following – 利用这些直方图之间的相似度作为特征向量,作者尝试了L1-norm ,L2-norm, Chi-square distance与直方图交,发现直方图交性能作 为优秀 – 在实现中,对于64x128的窗口划分为8x16=128个8x8区域,得到128 个直方图,直方图相似度一共有128x127/2=8,128个 • Furthermore, second order image statistics, especially co-occurrence histograms, are gaining popularity, pushing feature spaces to extremely high dimensions 19 Classifiers • SVMs – Linear SVM – Histogram Intersection Kernel SVM (HIKSVM) • MPLBoost: Multiple Pose Boosting (In ECCV08 workshop) – 将初始训练样本分成K个子集,同时训练K个强分类器,分类器输出 值是这K个分类器响应值的最大值 – 在训练过程中,只有被所有强分类器错分的样本权值保持不变,否 则该样本权值降低 – 在检测过程中,对于一个扫描窗口,如果有一个强分类器认为是 positive就是positive,如果所有强分类器认为是negative才是 negative 20 Evaluation protocol (1/4) • 人体检测系统衡量标准的不合理之处 – 现阶段用于确定“一个检测窗口是否命中一个人体”依据VOC准则 ,交并比>50% – 没有明确规定如何应对人群中人体与检测框的匹配问题 21 Evaluation protocol (2/4) • We split the set of annotations and detections into considered and ignored sets • Annotations can fall into the ignored set because of size, position, occlusion level, aspect ratio or non-pedestrian label in the Caltech setting • Detections can fall into the ignored set because of size. E.g. if we wish to evaluate on 50-pixel-or-taller, unoccluded pedestrians, any annotation labeled as occluded and any annotation or detection <50 pixels falls in the ignored set 22 Evaluation protocol (3/4) • For considered detections – If they match a considered annotation they count as true positive – If they match no annotation, or only one that has already been matched to another detection, they count as false positive – If they match an ignored annotation they are discarded • For ignored detections – If an ignored detection matches an ignored annotation, it should be discarded – If an ignored detection matches no annotation, it seems reasonable to discard it, but this may introduce a bias – If an ignored detection matches a considered annotation, count it as a true positive 23 Evaluation protocol (4/4) • To summarize, there is no single correct way how to evaluate on a subset of annotations, and all choices have undesirable side effects • It is therefore imperative that published results are accompanied by detections, and that evaluation scripts are made public • As there are boundary effects in almost any setting (all realistic datasets have a minimum annotation size), it must be possible for others to verify that differences are not artifacts of the evaluation 24 Outline • • • • • • Authors Abstract Main contributions Algorithms Experiments Conclusion 25 Database • INRIA人体数据库 • CalTech人体数据库 – – – – 2009年Dollar提出 视频序列 训练集包括192k人体,测试集155k人体 各种困难的情况,光照、遮挡、小尺度(人体高度3像素的都有)、人 群… – 标注非常完善,方便测试检测器的各种特性 • TUD-Brussel数据库 – 2009年Wojek提出 – 视频序列 – 仅有训练集,包括1,326个人,各种尺度各种视角 • 所有实验训练样本尺寸统一64x128,人体大小48x96,对齐 26 Experiment1 – HOG-LBP (1/1) INRIA TUD • However, while we were able to reproduce their good results on INRIA Person, we could not gain anything with LBPs on other datasets. They seem to be affected when imaging conditions change (in our case, we suspect demosaicing artifacts to be the issue) 27 Experiment2 – Color information (1/2) TUD TUD • More than 1fppi is usually not acceptable in any practical application • Self-similarity of colors is more appropriate than using the underlying color histograms directly as feature • On the contrary, adding the color histogram values directly even hurts the performance of HOG 28 Experiment2 – Color information (2/2) • Why CSS is effective? – Self-similarity encodes relevant parts like clothing and visible skin regions • Why directly using color information shows no improvements? – The training data was recorded with a different camera and in different lighting conditions than the test data, so that the weights learned for color do not generalize from one to the other. (Similar reason to Haar feature) 29 Experiment3 – Bootstrap (1/2) • With less than two bootstrapping rounds, performance depends heavily on the initial training set • At least two retraining rounds are required in HOG+linear SVM framework • This problem will be alleviated by using more initial negative samples, not solved 30 Experiment3 – Bootstrap (2/2) • For boosting classifiers (Fig. 3(c))3, the situation is worse: although mean performance seems stable over bootstrapping rounds, the overall variance only decreases slowly—the initial selection of negative samples has a high influence on the final performance even after 3 bootstrapping rounds 31 Experiment4 – Seed & self similarity(1/1) TUD • Self-similarity on HOG blocks shows little improvement • It is important to make sure the result does not depend on the initial selection of negative samples, e.g. by retraining enough rounds with SVMs 32 Experiment5 – CalTech pedestrian (1/2) 33 Experiment5 – CalTech pedestrian (2/2) • Color self-similarity is indeed complementary to gradient information • The motion information contributes greatly on pedestrian detection. The reason that HOF works so well on the “near” scale is probably that during multi-scale flow estimation compression artifacts are less visible at higher pyramid levels, so that the flow field is more accurate for larger people • The performance of all evaluated algorithms is abysmal under heavy occlusion 34 Experiment6 – Haar feature (1/1) TUD • Judging from the available research our feeling is that Haar features can potentially harm more than they help 35 Outline • • • • • • Authors Abstract Main contributions Algorithms Experiments Conclusion 36 Conclusion • 主要结论 – – – – 运动信息会对视频中的人体检测起到很大的促进作用(HOG) 颜色近似度对于人体检测器的性能有很大的提升(CSS) Bootstrap在检测器的学习过程起到关键作用 现阶段的物体检测评价标准不合理… • 次要结论 – LBP仅仅对于INRIA数据库有效 – HOG-linear SVM至少需要2轮bootstrap – 使用Haar特征辅助人体检测可能弊大于利 37 Thanks!! 38