Supplemental Material for A Benchmark of Computational Models F

advertisement
1
Supplemental Material for
A Benchmark of Computational Models
of Saliency to Predict Human Fixations
Tilke Judd, FreĢdo Durand, and Antonio Torralba
F
2
Area under ROC increases with number of observers
0.9
data
0.895
AUR Performance appears to converge to 0.922
data
log10(0.9221-AUR) = -0.3054*log10(x)-1.1528
-0.3054
AUR=-0.0734x
+0.9221
0.89
0.10
0.9221-AUR (log scale)
Area under ROC
0.885
0.88
0.875
0.87
0.865
0.08
0.06
0.04
0.86
0.855
0.85
0
2
4
6
8
10
12
number of observers
14
16
18
20
1
Similarity btwn fixation maps increases with number of observers
0.9
data
linear
quadratic
0.6
1-similarities (log scale)
Similarity btwn fixation maps
20 25 30 35404550
log10(y) = - 0.34*log10(x) - 0.11
2
log10(y) = - 0.075*log10(x) - 0.24*x - 0.13
0.7
0.8
5
10
15
number of observers (log scale)
S Performance converges to 1
1
0.9
0.8
1
2
0.7
0.6
0.5
0.4
0.5
0.4
0.3
0.2
0.3
0.2
0.1
0
0
2
4
6
8
10
12
number of observers
14
16
18
0.1
20
EMD btwn fixation maps decreases with number of observers
5
2
3
4 5 6 7 8 10 12 15
number of observers (log scale)
20
30
40 50
30
40 50
EMD Performance converges to 0
5
4
4.5
log10(y) = - 0.48*log10(x)+0.67
data
linear
3
EMD performance (log scale)
4
EMD btwn fixation maps
1
3.5
3
2.5
2
2
1
1.5
1
0
2
4
6
8
10
12
number of observers
14
16
18
20
0.5
1
2
5
10
15
number of observers (log scale)
20
Fig. 1. Measuring human performance. In these plots we see the performance of n observers to predict
fixations from n observers (for the ROC metric), and the similarity and earth movers distance between two
fixation maps, each of n independent observers. As the number of human observers used to create a human
“ground-truth” fixation map increases, performance to predict other humans increases. Therefore the human
baseline performance depends on the number of humans used. The limit of the human performance is 0.907
for ROC, 1 for Similarity and 0 for EMD. The plots on the right with fit curves indicate the rate at which
performance reaches that limit.
3
Finding optimal blur and center parameters for models under ROC metric
Scores increase with some center weight
0.9
0.85
Judd
GBVS
Center
0.8
Area under ROC
Itti&Koch2
BruceTsotsos
ContextAware
0.75
Torralba
HaoZhang
0.7
SUNsaliency
0.65
0.6
Judd
BruceTsotsos
GBVS
Center
Itti&Koch2
0.8
ContextAware
Torralba
0.75
SUNsaliency
HaoZhang
0.7
Itti&Koch
0.65
0.6
Achanta
0.55
Itti&Koch
0.55
0.5
Achanta
Chance
orig 0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9 1.0
Chance
0.45
0
10
20
increasing weight of center
80
100
0.65
0.564
0.756
0.813
0.813
0.783
0.751
0.684
Humans
AIM
0.523
Hou&Zhang
Chance
0.503
0.749
0.82
0.563
SUN
0.55
Itti&Koch
0.6
0.694
0.672
0.806
Judd
0.654
0.727
0.815
GBVS
0.707
0.7
0.806
Itti&Koch2
0.789
0.804
0.907
Cont.Awe
0.783
0.799
Torralba
0.783
Achanta
Area under the ROC curve
Error bars show standard error over images
0.75
0.5
60
(higher is better)
Centered Model
Blurred Model
Original Model
0.9
0.8
40
increasing amount of blur
Performance of models improves when blurred or combined with center map
0.95
0.85
30
Center
Area under ROC
0.85
0.5
Scores increase slightly with blur
0.9
Fig. 2. We optimize the blur and the center parameters for each model under the area under the ROC metric.
Top left shows the change in performance with increasing weight of the center bias as tested on 100 images
from the MIT ICCV 2009 data set. Top right shows the change in performance with increased blur on the
same tests images. While blurring makes only a slight improvement per model, adding a center bias greatly
improves many models. The best center bias and blurring parameter for each model was chosen and then
used to measure the updated performance on the benchmark data set. Updated performances are shown in
the bottom chart.
4
Finding optimal blur and center parameters for models under Similarity metric
Scores increase with some center weight
0.5
0.45
0.45
Judd
Judd
GBVS
Center
BruceTsotsos
0.4
Center
0.4
GBVS
0.35
Similarity
Similarity
Scores increase slightly with blur
0.5
0.35
Itti&Koch2
0.3
ContextAware
BruceTsotsos
Itti&Koch2
0.3
ContextAware
Torralba
SUNsaliency
0.25
HaoZhang
Itti&Koch
0.2
Torralba
SUNsaliency
0.15
HaoZhang
Chance
Itti&Koch
Achanta
0.2
orig 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
increasing weight of center
0.9
0.1
1.0
0
10
20
30
40
60
increasing amount of blur
Achanta
Chance
80
100
1
Performance of model improves when blurred or combined with center map
...
0.7
(higher is better)
Error bars show standard error over images
0.455
0.4
0.318
0.3
0.284
0.297
0.297
0.482
0.339
0.328
0.319
0.327
Chance
0.459
0.487
Hou&Zhang
Similarity
0.5
0.473
0.488
0.507
0.495
0.37
0.382
0.34
0.343
0.509
0.414
0.39
0.39
Cont.Awe
0.6
AIM
Centered Model
Blurred Model
Original Model
0.501
0.493
0.419
0.455
0.472
0.511
0.504
0.405
0.2
Humans
Judd
GBVS
Center
Itti&Koch2
Torralba
SUN
Achanta
0
Itti&Koch
0.1
Fig. 3. We optimize the blur and the center parameters for each model under Similarity metric. Further
description in Figure 2
5
Finding optimal blur and center parameters for models under EMD metric
Scores decrease with some center weight
8
Achanta
Earth Movers Distance
Earth Movers Distance
SUNsaliency
Itti&Koch
6
ContextAware
Torralba
BruceTsotsos
Itti&Koch2
5
GBVS
HaoZhang
8
Itti&Koch
SUNsaliency
ContextAware
Torralba
7
6
Itti&Koch2
5
BruceTsotsos
GBVS
4
Center
3
Judd
3
orig 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Center
Judd
0
10
20
increasing weight of center
7
6.854
6
30
40
(lower is better)
5.37
5
5.088
5.067
4.90
100
Centered Model
Blurred Model
Original Model
Error bars show standard error over images
4.715
4.859
4.56
4.236
4
3.72
3
80
Performance of models improves when blurred or combined with center map
6.137
3.703
60
increasing amount of blur
6.352
Earth Movers Distance
Achanta
Chance
9
Chance
7
HaoZhang
4
Scores increase slightly with blur
10
3.506
3.2
3.296
3.603
3.219
3.036
3.394
3.574
3.716
3.315
3.085
3.175
3.13
2
1
Humans
Judd
GBVS
Center
AIM
Itti&Koch2
Torralba
Cont.Awe
Itti&Koch
SUN
Hou&Zhang
Chance
0
Achanta
0
Fig. 4. We optimize the blur and the center parameters for each model under the EMD metric. Further
description in Figure 2
6
Model performances on images binned by fixation consistency
0.9
Area under ROC scores
(higher is better)
0.8
(higher is better)
0.5
0.4
0.4
0.3
0.3
4
3
2
0.1
0.1
Chance
Cont.Awe
GBVS
Itti&Koch
Center
0
Fixation consistency
High consistency
Medium consistency
Low consistency
5
0.2
0.2
(lower is better)
6
0.5
0.6
EMD scores
8
7
0.6
0.7
0
Similarity scores
0.7
1
Chance
Cont.Awe
GBVS
Itti&Koch
Center
0
Chance
Cont.Awe
GBVS
Itti&Koch
Center
Model performances on images binned by number of fixations per image
0.9
Area under ROC scores
(higher is better)
0.8
(higher is better)
0.5
0.4
0.4
0.3
0.3
4
3
2
0.1
0.1
Chance
Cont.Awe
GBVS
Itti&Koch
Center
0
Num of fixs per image from 39 viewers
Few fixs (176-281 fixs)
Medium # fixs (282-311 fixs)
Many fixs (312-456 fixs)
5
0.2
0.2
(lower is better)
6
0.5
0.6
EMD scores
8
7
0.6
0.7
0
Similarity scores
0.7
1
Chance
Cont.Awe
GBVS
Itti&Koch
Center
0
Chance
Cont.Awe
GBVS
Itti&Koch
Center
Fig. 5. Top row: Performance of 5 models for all three metrics on images binned by fixation consistency.
As images become less consistent, performance scores get slightly worse under the ROC metric, but gets
slightly better under the Similarity and EMD metrics. On images with low consistency, saliency maps often
predict low saliency everywhere which produces higher Similarity and EMD scores but lower ROC scores.
Bottom row: the same trend is seen when images are binned by number of fixations per image.
7
Images where humans very consistent, but models perform poorly.
Image
Fixation Map
Judd, 0.351
Similarity Scores from 3 top models
GBVS, 0.108
BruceTsotsos, 0.073
(a)
Judd, 0.182
GBVS, 0.282
BruceTsotsos, 0.276
Judd, 0.184
GBVS, 0.235
BruceTsotsos, 0.19
Judd, 0.211
GBVS, 0.253
BruceTsotsos, 0.21
Judd, 0.438
GBVS, 0.261
BruceTsotsos, 0.171
Judd, 0.307
GBVS, 0.309
BruceTsotsos, 0.197
(b)
(c)
(d)
(e)
(f )
Fig. 6. Current models have a very hard time predicting fixations of some images even though human
fixations are very consistent. Here are images for which ratio of the average performance over 3 of the top
performing to the consistency of humans models was lowest. For example, in image (d) observers read the
text about rip currents rather than look at the bright orange colors. The saliency models do not locate the
text at all. In image (e) observers fixate on the monkey’s faces while saliency models pick up the bright white
paper and orange mushroom in the background.
Download