Multi-Feature Based Visual Saliency Detection in Surveillance Video

advertisement
Visual Saliency Perception Oriented Surveillance Video
Yubing Tong
Abstract:
Video sequence is more complicated than only single image with additional temporal correlation. And the
motion information obviously modifies the perception of a video. Motion objects lead to the difference between
two neighboring frames which is usually focused on. By far, most papers have contributed to the image
saliency analysis but seldom to video saliency analysis. Surveillance video is a kind of conventional video
wildly used in real life. A new visual saliency perception model oriented surveillance video is proposed in this
paper. First, a new method for background generation and foreground extraction is developed based on scene
understanding in surveillance video. Then a stationary saliency model is setup to get stationary saliency map
based on multi-features from foreground. Every feature is analyzed with multi-scale Gaussian pyramids
technique and all the feature conspicuity are combined with different weights. Stationary model integrates
faces as a high level feature, as a supplement to other low-level features such as color, intensity and orientation.
Motion saliency map is calculated using the statistic of the motion vector field. Finally, motion saliency map
and stationary saliency map are fused with distance weights in 2D Gaussian distribution.
Compared with the gaze map of surveillance videos from subjective experiments with eye tracker, the
output of the proposed visual saliency perception model is close to gaze map.
Keywords: Visual saliency detection, motion saliency, background generation, face detection
1. Introduction
Under natural viewing conditions, humans tend to fixate on specific parts of an image or a video evoked
our interests naturally. These regions carry most useful information needed for our interpretation of the scenes.
Video contains more information than single image, the perception of video is also different from that of single
image because of the additional temporal correlation. Itti’s attention model and GAFFE are two typical
stationary image saliency analyzing methods [1, 2]. Both of them adopt the ‘down-up’ visual attention
mechanism. Low level images features including intensity, orientations and color or contrast are used to
construct features conspicuity which are then integrated into the final saliency map with WTA (Win Take All)
and IOR (Inhibition of Return) principle of visual nerve system. In [3], a simple frequency-tuned saliency
detection model is proposed. Image saliency map is considered to contain a wide range of frequencies and the
outputs of several band-pass filters are combined to compute the saliency map via DOG (Difference of
Gaussian). In [4], image saliency map is calculated by the image reconstructed by phase spectrum and inverse
Fourier transform which could reflect the contour. But this may not be enough, since the contour of an image is
far from containing all information in the image.
Usually video are viewed as image sequence frame by frame with certain frame rate to display. Through
video displaying, we can get a vivid perception of the real scene with some factors such as who, where, what
[5]. Papers have contributed more to the stationary image saliency map but seldom to video saliency map.
Video saliency will involve more information and be more complicated than image saliency. Some papers
mainly consider motion saliency map [6, 7, 8, and 9]. Motion is an important part in video, but what we feel in
videos is more than the only motion information, other information such as distance, depth etc is also involved.
Motion in a video usually can help to find objects or regions whose motion is salient or discontinuous contrast
to the context. In [6], raw motion map is described by using the difference of neighboring images. The final
saliency map can be achieved by smoothing the raw motion map with a uniform 5x5 field. In [7], spatialtemporal saliency is used to predict gaze direction for short videos. Successive frames are preprocessed with
retina model and Cortical-like filters, then stationary saliency map is obtained from the interaction between
1
filters and motion saliency is obtained from the module of motion vector derived from optic flow equation. As
for motion vector, besides the module, the angle or the motion direction is also important and necessary for
motion description but it is omitted in [7]. In [8], a motion attention model is given with motion intensity,
spatial coherence inductor and temporal coherence inductor. Coherence inductors are calculated based on
spatial phase histogram and temporal phase histogram distribution. In [9], the continuous rank on the
eigenvalue of coefficient metric derived from neighborhood optical flow equation is viewed as a measurement
for motion saliency.
The above saliency map methods are based on the bottom-up model. Motion feature and other stationary
features including color, orientation and intensity are viewed as low-level features in the bottom. Every feature
is individually analyzed for feature conspicuity and finally combined for final saliency map. In fact, human
perception is more complicated, both bottom-up and top-down framework should be involved. For example,
just after the several frames in the start of a video, we might search the similar objects in the following
successive frames unconsciously. And our eyes are also more easily to catch those objects, especially human
faces. And for surveillance videos, the unconscious searching operation might be more distinct. After the
several start frames, some scene information was confirmed, such as background information. Then foreground
objects are detected and will be very probably focused on in the following frames. So in this paper, for the
saliency in surveillance video, scene understanding is added in. Then based on the knowledge of scene, we
have the further perception of the video content including stationary and motion saliency map, and finally
different saliency maps are fused with distance weights.
The paper is arranged as following, background generation and foreground extraction based on scene
understanding are proposed in section2 and perception model for saliency detection are proposed in section3,
then experiment with the proposed method and Itti’s model and other model have been compared in section4.
And the conclusion is given in section5.
2. Scene understanding & background generation in surveillance video
As for scene understanding in video, three factors are necessarily included, who, where and what. Those
factors usually can be expressed in foreground objects, background and motion & events [10]. And for
surveillance video, after background confirmation, our attention will be focused on the moving part in the
foreground. So background is firstly generated followed with foreground extraction. Both of them are the basis
for further saliency analysis in surveillance video.
Here non-parameter background generation algorithms are focused on because of its simple and
intuitionistic theory foundation. In [11], mean shift is used to obtain the background pixel value by iterative
searching in all the pixel values emerging in a video sequence. In [12], QCH is used to replace mean shift to
find the suitable interval for histogram. In [13], a sliding window is used to find candidate subsequence with
the smallest difference between the current subsequence with the average value of previous frames, then to
choose a value in this subsequence with the restriction of the standard square error and the emerging times. For
every pixel in a video sequence, the average value of the previous frames must be updated which involved
much division operation. Here we proposed a new and intuitionistic scene background generation algorithm for
surveillance video based on sliding window and binary tree searching. In [14], experimental evidence is
provided that human memory for time-varying quality estimation seems to be limited to about 15 seconds,
which also means that video presented earlier 15 seconds ago has little impact on the current frame perception.
So here background is also viewed to be stable within such a limit time. And in the long video sequence, the
current background might need to be updated, but this belongs to another topic and not included in this paper.
Here background is considered to be stable.
2.1 Background generation
2
For surveillance video, background is considered to be stable and foreground objects emerging and changing
affect the current pixel value greatly but temporarily [12]. Several frames in a surveillance video are shown in
figure 1. Figure 2 shows the intensity variation at the center of the dark circle shown in the figure 1.
Figure 1. Four frames image in a station surveillance video
Figure 2. Pixel intensity variation at the center of the dark circle
Background generation is to find the most probable intensity and color value for every pixel. In [11], the
observed value for every pixel depends on several effect components descriptions (ECD) as following,
Vobsv  C  N sys  M obj  M bgd  Sillum  Dcam
(1)
Where, Vobsv is the observed values of the scene, C is the ideal background scene, N-sys is the system noise,
M  obj is moving object, M bgd is moving background and Sillum is the long term illumination change and
Dcam is camera displacement. Considering the scene understanding and background generation requirement
in surveillance video mentioned in the above, here we omit the camera movement and background movement.
Both the noise from image sensor, Nsys and long term illumination change are included in N-noise. Then a
simplified observed model for background generation and foreground extraction is given as following,
3
Vobsv  vbackground  N noise  v forground
(2)
N-noise can be considered as variable range for background pixel intensity value confirmation. Mean shift
algorithm is used to search the most emerging frequency for every pixel value in [0,255] in [11]. As for the
noise in scene understanding in surveillance video, the current pixel value in background should be in a small
range. And a sliding window mechanism might be more close to human perception on background. For
example, we might make the primary decision on whether the current pixel belongs to background after the
start frames in a video. With the displaying of video, just like moving a sliding window, if there is no change or
only small change and the change lasts very short, we will confirm our decision. So here we give a new and
intuitionistic definition of sliding window and a binary tree searching algorithm is used in the operation of
sliding window instead of just moving the sliding window step by step with the increase of frame number in
[13].
Wi
hi
li
Figure 3. Sliding window for background generation
The red rectangle sliding window superimposed on pixel intensity value curve is shown in figure 3 with several
attributes: every background sliding window is defined with the following attributes including:
Sliding window length (li), height (hi), mean value ( i ), standard square error (  i ) and frequency number of
all the pixels emerging in this windows (ni ). Then background generation is equal to find the best result of the
following restricted program,
min
i
ni
,
i  1,..., k
(3)
   0
s.t.  i
ni  n0
 0 and n0 const _ value
Where k is the current window number,  0 and n0 are constant value.
Since background is viewed to be stable and emerge often in video sequence, we think background part will
definitely emerge in the preceding part or the latter part of the video or both. A sliding window searching
method is proposed here but the sliding window is moved with binary tree searching algorithm in ‘jumping’
mode not ‘sliding’ as that in [13]. In [15], some seeding points are first chosen and after that sliding window is
constructed whose center is those seeding points. Our proposed method is more simple and intuitionistic as
shown in figure 4.
4
left
right
2N
2 N i
...
2 N i
Figure 4. Binary tree searching
The searching mode can be described with the following pseudo code,
Step 1. to choose 2N frames in surveillance video for background generation;
Step 2. to divide the original searching range into left and right, each with length 2N-1,
scale=N-1;
Step 3.
If (scale is small enough (scale <2))
goto step 5;
Else
goto step 4;
End
Step 4.
Tkleft,right 
k
nk
If Tleft < Tright
left frames is viewed as the total searching frames;
Else
right frames is viewed as the total searching frames;
End
scale -= 1;
goto step 3;
Step 5. to calculate the average value in the current sliding window;
Step 6. Stop background generation.
After all pixels of background are searched, a median filter is used to reduce the points which should belong to
background but emerge in foreground. Some background pixel values might be estimated with several key
frames for saving calculation, for example, the pixels in top-left corner in the following figure 5 are always
stable in the whole video sequence. Figure 5 shows the results of background generation. Figure (a) and (b) are
two examples of the results form the above background generation algorithm. Three images in a video are
shown and the right bottom is the generated background.
5
(a) Background generation: example 1
(b) Background generation: example 2
Figure 5. Examples of background generation
2.2 Foreground extraction
Foreground can be extracted by comparing the background and the current frame. With considering the noise in
scene understanding, foreground could be extracted with the following equation derived from equation (2),
V forground  Vobsv  Vbackground  N noise
(4)
Figure 6 shows some results from foreground extraction. We can see that the objects in foreground are
extracted except for only a few points in those objects. Those points have been decided as background because
their intensity and color is almost equal to that of background, for example, in the marked blue circle, some
points are lost in the foreground objects. With some other information, such as motion vector in the same object
or in the same macro block, this case could be improved. Since motion saliency based on motion vector will be
analyzed later in this paper, the current foreground objects are acceptable for stationary saliency analysis.
6
Figure 6. Foreground extraction.
3. Multi-feature perception model for saliency detection
3.1 Framework and algorithms
Figure 7. Framework for saliency detection
Figure 7 shows the framework for saliency detection in surveillance video. Based on the results from
background generation and foreground extraction, stationary saliency is computed via multi-feature conspicuity
including face and some low level features such as color intensity and orientation; motion saliency is calculated
based on motion analysis and distance effect on visual perception. Algorithms are include in the following three
sections:
(1) Stationary saliency model with face as a high level feature and intensity, color, orientation as low-level
features;
(2) Motion saliency model with motion vector field measurement and distance weights in Gaussian model;
(3) Fuse method for stationary saliency map and motion saliency map.
3.2 Multi-feature stationary saliency
Low level features such as intensity, color and orientation features can contribute much to our attention in
Itti’s bottom-up attention framework. Every feature is analyzed using Gaussian pyramid and multi-scales. 7
feature maps are generated including one intensity, four orientations (at 0, 45, 90,135 degree) and two color
components (red/green and blue/yellow) conspicuous maps. After a normalization step, all those feature maps
are summed to 3 conspicuous maps including intensity conspicuous map C i , color conspicuous map C c and
7
orientation conspicuous map C o . Finally the saliency maps are combined together to get the saliency maps
according to the following equation
SItti=
1
 Ck
3 k i , c , o
(5)
Faces are features which focus more attention than other features in many images. Psychological tests
have proven that face, head or hands can be perceived prior to any other details [16]. So faces can be used as
high level features for saliency map. One drawback of Itti’s visual attention mechanism model is that its
saliency map model is not well adapted for images with faces. Several studies in face recognition have shown
that skin hue features could be used to extract the face information. To detect heads and hands in images, we
have used the face recognition and location algorithm used by Walther et al [17]. This algorithm is based on a
Gaussian model of the skin hue distribution in the (r’, g’) color space as independent feature. For a given color
pixel (r’, g’), the model’s hue response is then defined by the following equation,
 1  (r '   ) 2 ( g '   g ) 2  (r '   r )( g '   g )  
r

h(r ' , g ' )  exp   


2
2

 2




r g
r
g



r' 
r
r  g b
and
2
Where (  r ,  g ) is the average of the skin hue distributions,  r and
g' 
g
r  g b
(6)
(7)
 g2 are the variances of the r’ and g’
components, and  is the correlation between the components r’ and g’. These parameters had been
statistically estimated from 1153 photographs which contained faces. The function h(r ' , g ' ) can be considered
as a color variability function around a given hue.
Then stationary saliency based on multi-features conspicuity can be described as following,
SS = f ( S Itti , S Face )
(8)
Here we choose adding weights model as following,
SS 
1
2Ci  2Cc  Co  3CF 
8
(9)
For most of images containing faces, heads or hands, the model with skin hue detection gives better results
than the Itti’s model, i.e. more accurate saliency maps. The example given in this paper shows the difference
between Itti’s model and the stationary model for face images. ‘I18’ is the original reference image including
face, eyes and hands in TID2008 [21] as shown in figure 8 (a). Figure 8 (b) showed the saliency map from the
mixed model and figure (c) showed the saliency map from Itti’s model. The result from the mixed model seems
more reliable. Besides some common place, figure (b) should be close to our real perception since we usually
focus on the neighbor around eyes because we trended to observe, find and understand the expression on face.
8
Reference I18 of TID2008
saliency map with skin hue detection of I18 in TID2008
saliency map without skin hue detection of I18 in TID2008
60
60
50
100
150
150
200
250
50
50
50
100
40
200
40
150
height
100
height
height
50
30
250
200
30
250
20
300
20
300
300
10
350
10
350
50
100
150
200
250
300
width
350
400
450
500
350
100
200
300
400
500
100
width
200
300
400
500
width
(a) I18 image in TID2008 (b) The saliency map of mix model.
(c) The saliency map from Itti’s model.
Figure 8. Saliency region from Itti’s model and mixed model.
3.3 Motion saliency map and Gaussian weights model
Motion feature must be involved in video saliency map as it plays a very important role in surveillance
video. With motion perception, we could know what is happening in the current scene. And some regions or
objects are not so salient in video although they might be salient in images, for example, the rich texture
minutia of the object in images will be omitted in videos with fast motion.
In this paper, motion information of a video is analyzed with its motion vector field which can be
calculated using motion estimation with more than one reference image. Here we used full searching and block
matching to find the best motion vectors which are normally used in video compressing. The following shows
an example of motion estimation and motion vector field.
(a) #62 frame in video sequence
(b) motion vectors of #62 frame
Figure 9. Motion vectors field
Here #62 frame is viewed as the current frame, and its previous frame #61 frame is shown in figure (a).
We can see that there are motion in the blue circle because of light flicker or other noise although they should
be still. It is fortunate that the effect of those pseudo motions can be cut off with background extraction.
Based on the motion vector field, the intensity of motion vector, spatial coherence and temporal coherence
of the motion are used to describe motion saliency map [8]. The intensity of motion vector is computed with the
following equation,
I
Besides the intensity, the phase
mvx2  (mvy)2
(10)
 of motion vector will be also analyzed.
  arctan(
mvy
)
mvx
(11)
 will distribute in [0, 360] after normalization.
Within the motion vector field, motion vector of the blocks neighboring the current block C such as block
A and B etc. will also be analyzed. The distribution probability density i is computed using the histogram
9
distribution of  value within the overlapped neighboring field with size of k x k, for example 7x7 is used here.
Figure 10 shows an example of the motion vectors in neighborhood and its corresponding phase values
histogram distribution as following,
(a) Motion vector field in 7th frame
(b) Histogram of the neighborhood of 145th MB
Figure 10. Motion vector field and phase value histogram, in figure (a), the central blue rectangle means the
current MB and the red rectangle means neighborhood, here 9 bins for histogram
The spatial motion saliency Cs is calculated as following,
N
Cs   i  lg i
(12)
i 1
N is the number of histogram bins of  value in kxk field.
Temporal saliency map Ct can be defined in the same way. Then motion saliency map is computed in the
following equation
SM  I  Ct (1  I  Cs)
(13)
I is motion intensity by computing the magnitude of motion vector, Ct is the temporal coherence inductor and
Cs is the spatial coherency based on spatial phase histogram and temporal phase histogram statistic and
analysis.
3.4 Fuse model of stationary and motion saliency map
The stationary saliency map SS and motion saliency map SM can be fused to obtain the final saliency map
of every frame in a video. Since we usually more easily focus on those objects emerging into the center of
observing window than that is far away from the center, we propose a distance weight fusing model as
following,
SV _ G    S M  wi _ mb  1     S s
(14)
wi _ mb  ed / 8
d
xc 
xi  xc 2   yi  yc 2
width _ mb
height _ mb
; yc 
2
2
10
(15)
Where width_mb and height_mb are the height and width in mb (16x16), wi _ mb is the block distance weight in
2D Gaussian distribution shown in figure 11.
(a) block position
(b) 2-D Gaussian distance model
Figure 11. Block distance weights model
Beside the above fusing method, some other fuse mode are also designed for comparison including Mean, Max
and pixel multiplication fusion mode as following,
SV _ mean ( x, y, k ) 
S M
 SS 
2
(16)
SV _ max ( x, y, k )  Max ( S M , S S )
(17)
SV _ multip( x, y, k )  S M  S S
(18)
4. Experiments
In this section, several surveillance video sequences recorded by ourselves are used in the subjective
experiments and our visual perception model. The target of video saliency detection is to make the video
saliency map from our model mimic the gaze map derived from subjective experiments. 20 observers joined in
the experiments with a SMI remote eye tracker [18]. First the fixations corresponding to each image will be
recorded with the eye tracker and its software suite. Then the fixations are normalized and Gaussian filtered to
obtain a gaze map. The final gaze map is achieved by taking the mean of all the 20 gaze maps from observers.
An examples is given to show the gaze maps from the original image in figure 12. Figure (a) is the original
image, figure (b) is the corresponding gaze from subjective experiments and figure (c) is the mixed image with
gaze map superimposed over the original image.
(a) Original image
(b) Gaze map
(c) Mixed image of gaze map and image
Figure 12. Original image, gaze map and mixed image with gaze map superimposed over original image
Figure 13 shows two frame images in surveillance video and the corresponding mixed images with gaze maps
and saliency map. Figure (a) and (d) are two original frames in video; (b) and (e) are the mixed images of gaze
map and the original image; (c) and (f) are two mixed image with saliency map superimposed over the original
image.
11
(a)Original image
(b) Gaze map
(c) Saliency map
(d) Original image
(e) Gaze map
(f) Saliency map
Figure 13. Frame image gaze map and saliency map
Besides subjective comparison, various criteria could be used for the comparison such as some distance
measurement and ROC [19]. Here NSS (Normalized Scanpath Saliency) is used to estimate the relationship
between gaze map and saliency map as that in [7, 20]. For example, NSS can be used to compare gaze map and
mean saliency map in equation (16) as following,
NSS k  
SV _ m _ G  SV _ m ( x, y, k )
S
(19)
V _ m ( x , y ,k )
SV _ m _ G ( x, y, k )  GV x, y, k   SV _ m ( x, y, k )
(20)
Where GV x, y, k  is the human eye gaze map normalized to obtain unit mean, and SV _ m ( x, y , k ) is the
saliency map from detection model.
S
V _ m ( x, y ,k )
is the standard square error. NSS is a standard score criteria
which expresses the divergence of the experimental result from the mean saliency map as a number of standard
deviations of the model. The larger the value of the score, the less probable it is that the experimental result is
due to chance. SV _ m _ G ( x, y, k ) can be viewed as the mean of the saliency values at eye position and
SV _ m ( x, y, k ) can be viewed as the mean of the saliency values on the whole frame. If NSS is positive, that
means eye postion tends to be on salient regions; if NSS is negative, that means eye position tends to be on
non-salient regions; if NSS is null, that means there is no link between eye position and saliency. In [22], the
subjective gaze map is called as real eye movement gaze map, besides this gaze map, another randomized eye
movement gaze map is also introduced for comparison. The randomized eye gaze map mean to associate to a
frame of the current video the eye movement of subjects when they were looking at another video clip.
Since NSS is used to compare the saliency map from our model with gaze map, if our model can predict
the eye movement well, NSS of real eye movement and saliency map should be high and NSS of randomized
eye movement and saliency map should low at the same time. Table 1 gives out some data about NSS with real
eye movement or randomized eye movement. Here we considered four saliency map derived from different
12
weights fusing methods for stationary saliency map and motion saliency map.
Table 1. Gaze map and saliency map comparison
Fuse mode
Smean
Smax
Smulti
Criteria
p
NSS on real gaze map
0.36
0.3
0.15
7
12
0
NSS on randomized
gaze map
0.02
0
0.0
-0.15
SGaussian
1.066
0.195
32
We also compare our result with other saliency detection algorithm including itti’s model, frequency tune
saliency detection and phase spectrum saliency as shown in table 1.
Table 2. Comparison among saliency models
model
IT
FT(Frequency Tune)
Phase Spectrum
Proposed
criteria
NSS on real gaze 0.123
0.160
0.002
1.066
map
NSS on randomized 0.136
0.189
-0.044
0.195
gaze map
According to the above data, NSS on real gaze map derived from our proposed model is higher than that
from other methods such as Itti, frequence tune and phase spectrum ; NSS on randomized gaze map is lower
than that from other methods. And the results from Gaussian distance weighting is the best. The reason is that
Ittit' model , frequency tune and phase spectrum saliency detection are for stationary saliency without motion
information. Our proposed saliency perception model includes motion information with distance weights
adding. The results show that our proposed method could generate result close to the subjective gaze map. The
results also show that motion is very important for the visual perception in video and should be fully used. And
saliency perception in video is much different from image saliency perception because of motion.
5. Conclusion
In this paper, a new visual saliency detection algorithm oriented surveillance video is proposed. With the
knowledge of scene understanding in surveillance video, background generation and foreground objects
extraction are analyzed, and then multi-features including high level feature such as face and other low level
feature including color, orientation and intensity have been used to construct stationary feature conspicuity.
Motion saliency map is based on the motion vector analysis and motion saliency map and stationary saliency
map are fused with Gaussian distance weights. Compared saliency map with the gaze map of surveillance
videos from subjective experiments, the output of the multi-feature based video saliency detection model is
close to gaze map. Here we mainly consider surveillance video with stable background, next we will focus on
more complicated scene, such as the background and foreground object are both moving, more refined
algorithm is necessary to get the suitable foreground object for saliency analysis. And the current binary tree
searching background pixel might involve too much calculation, so next the neighborhood information and
multi-scale technique will be researched for optimization.
References
[1]L.Itti, C.Koch and E.Niebur, “A model of saliency-based visual attention for rapid scene analysis” IEEE
Trans. PAMI., vol. 20, No.11, pp.1254-1259, Nov. 1998.
[2] Rajashekar, U.; van der Linde, I.; Bovik, A.C.; Cormack, L.K, "GAFFE: A gaze-attentive fixation finding
engine," IEEE Trans Image Processing, vol. 17, No.4, pp. 564-573.
13
[3] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. “Frequency-tuned
saliency detection model”, CVPR2009.
[4] Qi Ma and Liming Zhang. "Saliency-Based Image Quality Assessment Criterion", ICIC 2008, LNCS
5226, pp. 1124–1133, 2008.
[5] L.-J. Li and L. Fei-Fei. “What, where and who? Classifying event by scene and object recognition”.
IEEE Intern. Conf. in Computer Vision (ICCV). 2007.
[6] Brian Michacel Scacellat. “Theory of Mind for a Humanoid Robot”, Autonomous Robert, vol. 12, No.1,
pp.13-24, 2002
[7] S.Marat, T.Ho Phuoc. Spatio-temporal saliency model to predict eye movements in video free viewing,
16th European Signal Processing Conference EUSIPCO-2008, Lausanne: Suisse (2008)
[8] Yufei Ma, Hongjing Zhang. A model of motion attention for video skimming. ICIP 2002.
[9] Shan Li, Lee, M.C. “Fast Visual Tracking using Motion Saliency in Video”, ICASSP. vol.1, pp.1073-1076.
2007
[10] L.-J. Li, R. Socher and L. Fei-Fei. Towards Total Scene Understanding: Classification, Annotation and
Segmentation in an Automatic Framework. Computer Vision and Pattern Recognition (CVPR) 2009.
[11] Liu Yazhou, Hongxun Yao, Wengao, xilin chen, debin zhao, “non parametric background generation”,
Journal of visual communication and image representation, 18 (2007), 253-263.
[12] Desihe Sidide and Oliver Strauss. “A fast and automatic background generation method from a video
based on QCH”, Journal of visual communication and image representation, April. 2009.
[13] Hanzi Wang, David Suter. “A novel robust statistical method for background initialization and visual
surveillance”, ACCV 2006, LNCS 3851, pp.328-337, 2006.
[14] M. Pinson and S. Wolf, “Comparing subjective video quality testing methodologies,” SPIE Video
Communications and Image Processing Conference, Lugano, Switzerland, Jul. 8-11 2003.
[15] Alan J.Lipton, Niles Haering, Mark C.Almen, Peter L. Venetianer, Thomas E.Slowe. Zhong
Zhang .Video scene background maintenance using statistical pixel modeling. United States Patent
Application Publication. Pub. No.: US 2004/0126014 A1, Jul.1, 2004.
[16]R Desimone, TD Albright, CG Gross and C Bruce. "Stimulus selective properties of inferior temporal
neurons in the macaque", Journal of Neuroscience, vol4, 2051-2062, 1984.
[17] Walther, D., Koch, "Modeling Attention to Salient Proto-objects", Neural Networks 19, 1395–1407,
2006
[18]Puneet Sharama. “Perceptual image difference metrics-saliency maps & eye tracking”, Jan.20, 2008.
[19]B.W.Tatler, R.J.Baddeley and I.D.Gilchrist, ‘visual correlates of fixation selection: effects of scale and
times’, vision research, vol. 45, pp.643-659, 2005.
[20] R.j. Peters, A.Iyer, L.Itti and C.Koch. ‘Components of bottom-up gaze map allocation in natural
images’, vision research, vol.45, pp.2397-2416, 2005.
[21] TID2008 page: http://www.ponomarenko.info/tid2008.htm.
[22] S.Marat, T.Ho Phuoc, L.Granjon, N.Guyader. Modelling spatio-temporal saliency to predict gaze
direction for short videos. International Journal of Computer Vision, vol.82, No.3, pp.231-243,2009.
14
Download