1 Supplemental Material for A Benchmark of Computational Models of Saliency to Predict Human Fixations Tilke Judd, FreĢdo Durand, and Antonio Torralba F 2 Area under ROC increases with number of observers 0.9 data 0.895 AUR Performance appears to converge to 0.922 data log10(0.9221-AUR) = -0.3054*log10(x)-1.1528 -0.3054 AUR=-0.0734x +0.9221 0.89 0.10 0.9221-AUR (log scale) Area under ROC 0.885 0.88 0.875 0.87 0.865 0.08 0.06 0.04 0.86 0.855 0.85 0 2 4 6 8 10 12 number of observers 14 16 18 20 1 Similarity btwn fixation maps increases with number of observers 0.9 data linear quadratic 0.6 1-similarities (log scale) Similarity btwn fixation maps 20 25 30 35404550 log10(y) = - 0.34*log10(x) - 0.11 2 log10(y) = - 0.075*log10(x) - 0.24*x - 0.13 0.7 0.8 5 10 15 number of observers (log scale) S Performance converges to 1 1 0.9 0.8 1 2 0.7 0.6 0.5 0.4 0.5 0.4 0.3 0.2 0.3 0.2 0.1 0 0 2 4 6 8 10 12 number of observers 14 16 18 0.1 20 EMD btwn fixation maps decreases with number of observers 5 2 3 4 5 6 7 8 10 12 15 number of observers (log scale) 20 30 40 50 30 40 50 EMD Performance converges to 0 5 4 4.5 log10(y) = - 0.48*log10(x)+0.67 data linear 3 EMD performance (log scale) 4 EMD btwn fixation maps 1 3.5 3 2.5 2 2 1 1.5 1 0 2 4 6 8 10 12 number of observers 14 16 18 20 0.5 1 2 5 10 15 number of observers (log scale) 20 Fig. 1. Measuring human performance. In these plots we see the performance of n observers to predict fixations from n observers (for the ROC metric), and the similarity and earth movers distance between two fixation maps, each of n independent observers. As the number of human observers used to create a human “ground-truth” fixation map increases, performance to predict other humans increases. Therefore the human baseline performance depends on the number of humans used. The limit of the human performance is 0.907 for ROC, 1 for Similarity and 0 for EMD. The plots on the right with fit curves indicate the rate at which performance reaches that limit. 3 Finding optimal blur and center parameters for models under ROC metric Scores increase with some center weight 0.9 0.85 Judd GBVS Center 0.8 Area under ROC Itti&Koch2 BruceTsotsos ContextAware 0.75 Torralba HaoZhang 0.7 SUNsaliency 0.65 0.6 Judd BruceTsotsos GBVS Center Itti&Koch2 0.8 ContextAware Torralba 0.75 SUNsaliency HaoZhang 0.7 Itti&Koch 0.65 0.6 Achanta 0.55 Itti&Koch 0.55 0.5 Achanta Chance orig 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Chance 0.45 0 10 20 increasing weight of center 80 100 0.65 0.564 0.756 0.813 0.813 0.783 0.751 0.684 Humans AIM 0.523 Hou&Zhang Chance 0.503 0.749 0.82 0.563 SUN 0.55 Itti&Koch 0.6 0.694 0.672 0.806 Judd 0.654 0.727 0.815 GBVS 0.707 0.7 0.806 Itti&Koch2 0.789 0.804 0.907 Cont.Awe 0.783 0.799 Torralba 0.783 Achanta Area under the ROC curve Error bars show standard error over images 0.75 0.5 60 (higher is better) Centered Model Blurred Model Original Model 0.9 0.8 40 increasing amount of blur Performance of models improves when blurred or combined with center map 0.95 0.85 30 Center Area under ROC 0.85 0.5 Scores increase slightly with blur 0.9 Fig. 2. We optimize the blur and the center parameters for each model under the area under the ROC metric. Top left shows the change in performance with increasing weight of the center bias as tested on 100 images from the MIT ICCV 2009 data set. Top right shows the change in performance with increased blur on the same tests images. While blurring makes only a slight improvement per model, adding a center bias greatly improves many models. The best center bias and blurring parameter for each model was chosen and then used to measure the updated performance on the benchmark data set. Updated performances are shown in the bottom chart. 4 Finding optimal blur and center parameters for models under Similarity metric Scores increase with some center weight 0.5 0.45 0.45 Judd Judd GBVS Center BruceTsotsos 0.4 Center 0.4 GBVS 0.35 Similarity Similarity Scores increase slightly with blur 0.5 0.35 Itti&Koch2 0.3 ContextAware BruceTsotsos Itti&Koch2 0.3 ContextAware Torralba SUNsaliency 0.25 HaoZhang Itti&Koch 0.2 Torralba SUNsaliency 0.15 HaoZhang Chance Itti&Koch Achanta 0.2 orig 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 increasing weight of center 0.9 0.1 1.0 0 10 20 30 40 60 increasing amount of blur Achanta Chance 80 100 1 Performance of model improves when blurred or combined with center map ... 0.7 (higher is better) Error bars show standard error over images 0.455 0.4 0.318 0.3 0.284 0.297 0.297 0.482 0.339 0.328 0.319 0.327 Chance 0.459 0.487 Hou&Zhang Similarity 0.5 0.473 0.488 0.507 0.495 0.37 0.382 0.34 0.343 0.509 0.414 0.39 0.39 Cont.Awe 0.6 AIM Centered Model Blurred Model Original Model 0.501 0.493 0.419 0.455 0.472 0.511 0.504 0.405 0.2 Humans Judd GBVS Center Itti&Koch2 Torralba SUN Achanta 0 Itti&Koch 0.1 Fig. 3. We optimize the blur and the center parameters for each model under Similarity metric. Further description in Figure 2 5 Finding optimal blur and center parameters for models under EMD metric Scores decrease with some center weight 8 Achanta Earth Movers Distance Earth Movers Distance SUNsaliency Itti&Koch 6 ContextAware Torralba BruceTsotsos Itti&Koch2 5 GBVS HaoZhang 8 Itti&Koch SUNsaliency ContextAware Torralba 7 6 Itti&Koch2 5 BruceTsotsos GBVS 4 Center 3 Judd 3 orig 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Center Judd 0 10 20 increasing weight of center 7 6.854 6 30 40 (lower is better) 5.37 5 5.088 5.067 4.90 100 Centered Model Blurred Model Original Model Error bars show standard error over images 4.715 4.859 4.56 4.236 4 3.72 3 80 Performance of models improves when blurred or combined with center map 6.137 3.703 60 increasing amount of blur 6.352 Earth Movers Distance Achanta Chance 9 Chance 7 HaoZhang 4 Scores increase slightly with blur 10 3.506 3.2 3.296 3.603 3.219 3.036 3.394 3.574 3.716 3.315 3.085 3.175 3.13 2 1 Humans Judd GBVS Center AIM Itti&Koch2 Torralba Cont.Awe Itti&Koch SUN Hou&Zhang Chance 0 Achanta 0 Fig. 4. We optimize the blur and the center parameters for each model under the EMD metric. Further description in Figure 2 6 Model performances on images binned by fixation consistency 0.9 Area under ROC scores (higher is better) 0.8 (higher is better) 0.5 0.4 0.4 0.3 0.3 4 3 2 0.1 0.1 Chance Cont.Awe GBVS Itti&Koch Center 0 Fixation consistency High consistency Medium consistency Low consistency 5 0.2 0.2 (lower is better) 6 0.5 0.6 EMD scores 8 7 0.6 0.7 0 Similarity scores 0.7 1 Chance Cont.Awe GBVS Itti&Koch Center 0 Chance Cont.Awe GBVS Itti&Koch Center Model performances on images binned by number of fixations per image 0.9 Area under ROC scores (higher is better) 0.8 (higher is better) 0.5 0.4 0.4 0.3 0.3 4 3 2 0.1 0.1 Chance Cont.Awe GBVS Itti&Koch Center 0 Num of fixs per image from 39 viewers Few fixs (176-281 fixs) Medium # fixs (282-311 fixs) Many fixs (312-456 fixs) 5 0.2 0.2 (lower is better) 6 0.5 0.6 EMD scores 8 7 0.6 0.7 0 Similarity scores 0.7 1 Chance Cont.Awe GBVS Itti&Koch Center 0 Chance Cont.Awe GBVS Itti&Koch Center Fig. 5. Top row: Performance of 5 models for all three metrics on images binned by fixation consistency. As images become less consistent, performance scores get slightly worse under the ROC metric, but gets slightly better under the Similarity and EMD metrics. On images with low consistency, saliency maps often predict low saliency everywhere which produces higher Similarity and EMD scores but lower ROC scores. Bottom row: the same trend is seen when images are binned by number of fixations per image. 7 Images where humans very consistent, but models perform poorly. Image Fixation Map Judd, 0.351 Similarity Scores from 3 top models GBVS, 0.108 BruceTsotsos, 0.073 (a) Judd, 0.182 GBVS, 0.282 BruceTsotsos, 0.276 Judd, 0.184 GBVS, 0.235 BruceTsotsos, 0.19 Judd, 0.211 GBVS, 0.253 BruceTsotsos, 0.21 Judd, 0.438 GBVS, 0.261 BruceTsotsos, 0.171 Judd, 0.307 GBVS, 0.309 BruceTsotsos, 0.197 (b) (c) (d) (e) (f ) Fig. 6. Current models have a very hard time predicting fixations of some images even though human fixations are very consistent. Here are images for which ratio of the average performance over 3 of the top performing to the consistency of humans models was lowest. For example, in image (d) observers read the text about rip currents rather than look at the bright orange colors. The saliency models do not locate the text at all. In image (e) observers fixate on the monkey’s faces while saliency models pick up the bright white paper and orange mushroom in the background.