Visual Saliency Perception Oriented Surveillance Video Yubing Tong Abstract: Video sequence is more complicated than only single image with additional temporal correlation. And the motion information obviously modifies the perception of a video. Motion objects lead to the difference between two neighboring frames which is usually focused on. By far, most papers have contributed to the image saliency analysis but seldom to video saliency analysis. Surveillance video is a kind of conventional video wildly used in real life. A new visual saliency perception model oriented surveillance video is proposed in this paper. First, a new method for background generation and foreground extraction is developed based on scene understanding in surveillance video. Then a stationary saliency model is setup to get stationary saliency map based on multi-features from foreground. Every feature is analyzed with multi-scale Gaussian pyramids technique and all the feature conspicuity are combined with different weights. Stationary model integrates faces as a high level feature, as a supplement to other low-level features such as color, intensity and orientation. Motion saliency map is calculated using the statistic of the motion vector field. Finally, motion saliency map and stationary saliency map are fused with distance weights in 2D Gaussian distribution. Compared with the gaze map of surveillance videos from subjective experiments with eye tracker, the output of the proposed visual saliency perception model is close to gaze map. Keywords: Visual saliency detection, motion saliency, background generation, face detection 1. Introduction Under natural viewing conditions, humans tend to fixate on specific parts of an image or a video evoked our interests naturally. These regions carry most useful information needed for our interpretation of the scenes. Video contains more information than single image, the perception of video is also different from that of single image because of the additional temporal correlation. Itti’s attention model and GAFFE are two typical stationary image saliency analyzing methods [1, 2]. Both of them adopt the ‘down-up’ visual attention mechanism. Low level images features including intensity, orientations and color or contrast are used to construct features conspicuity which are then integrated into the final saliency map with WTA (Win Take All) and IOR (Inhibition of Return) principle of visual nerve system. In [3], a simple frequency-tuned saliency detection model is proposed. Image saliency map is considered to contain a wide range of frequencies and the outputs of several band-pass filters are combined to compute the saliency map via DOG (Difference of Gaussian). In [4], image saliency map is calculated by the image reconstructed by phase spectrum and inverse Fourier transform which could reflect the contour. But this may not be enough, since the contour of an image is far from containing all information in the image. Usually video are viewed as image sequence frame by frame with certain frame rate to display. Through video displaying, we can get a vivid perception of the real scene with some factors such as who, where, what [5]. Papers have contributed more to the stationary image saliency map but seldom to video saliency map. Video saliency will involve more information and be more complicated than image saliency. Some papers mainly consider motion saliency map [6, 7, 8, and 9]. Motion is an important part in video, but what we feel in videos is more than the only motion information, other information such as distance, depth etc is also involved. Motion in a video usually can help to find objects or regions whose motion is salient or discontinuous contrast to the context. In [6], raw motion map is described by using the difference of neighboring images. The final saliency map can be achieved by smoothing the raw motion map with a uniform 5x5 field. In [7], spatialtemporal saliency is used to predict gaze direction for short videos. Successive frames are preprocessed with retina model and Cortical-like filters, then stationary saliency map is obtained from the interaction between 1 filters and motion saliency is obtained from the module of motion vector derived from optic flow equation. As for motion vector, besides the module, the angle or the motion direction is also important and necessary for motion description but it is omitted in [7]. In [8], a motion attention model is given with motion intensity, spatial coherence inductor and temporal coherence inductor. Coherence inductors are calculated based on spatial phase histogram and temporal phase histogram distribution. In [9], the continuous rank on the eigenvalue of coefficient metric derived from neighborhood optical flow equation is viewed as a measurement for motion saliency. The above saliency map methods are based on the bottom-up model. Motion feature and other stationary features including color, orientation and intensity are viewed as low-level features in the bottom. Every feature is individually analyzed for feature conspicuity and finally combined for final saliency map. In fact, human perception is more complicated, both bottom-up and top-down framework should be involved. For example, just after the several frames in the start of a video, we might search the similar objects in the following successive frames unconsciously. And our eyes are also more easily to catch those objects, especially human faces. And for surveillance videos, the unconscious searching operation might be more distinct. After the several start frames, some scene information was confirmed, such as background information. Then foreground objects are detected and will be very probably focused on in the following frames. So in this paper, for the saliency in surveillance video, scene understanding is added in. Then based on the knowledge of scene, we have the further perception of the video content including stationary and motion saliency map, and finally different saliency maps are fused with distance weights. The paper is arranged as following, background generation and foreground extraction based on scene understanding are proposed in section2 and perception model for saliency detection are proposed in section3, then experiment with the proposed method and Itti’s model and other model have been compared in section4. And the conclusion is given in section5. 2. Scene understanding & background generation in surveillance video As for scene understanding in video, three factors are necessarily included, who, where and what. Those factors usually can be expressed in foreground objects, background and motion & events [10]. And for surveillance video, after background confirmation, our attention will be focused on the moving part in the foreground. So background is firstly generated followed with foreground extraction. Both of them are the basis for further saliency analysis in surveillance video. Here non-parameter background generation algorithms are focused on because of its simple and intuitionistic theory foundation. In [11], mean shift is used to obtain the background pixel value by iterative searching in all the pixel values emerging in a video sequence. In [12], QCH is used to replace mean shift to find the suitable interval for histogram. In [13], a sliding window is used to find candidate subsequence with the smallest difference between the current subsequence with the average value of previous frames, then to choose a value in this subsequence with the restriction of the standard square error and the emerging times. For every pixel in a video sequence, the average value of the previous frames must be updated which involved much division operation. Here we proposed a new and intuitionistic scene background generation algorithm for surveillance video based on sliding window and binary tree searching. In [14], experimental evidence is provided that human memory for time-varying quality estimation seems to be limited to about 15 seconds, which also means that video presented earlier 15 seconds ago has little impact on the current frame perception. So here background is also viewed to be stable within such a limit time. And in the long video sequence, the current background might need to be updated, but this belongs to another topic and not included in this paper. Here background is considered to be stable. 2.1 Background generation 2 For surveillance video, background is considered to be stable and foreground objects emerging and changing affect the current pixel value greatly but temporarily [12]. Several frames in a surveillance video are shown in figure 1. Figure 2 shows the intensity variation at the center of the dark circle shown in the figure 1. Figure 1. Four frames image in a station surveillance video Figure 2. Pixel intensity variation at the center of the dark circle Background generation is to find the most probable intensity and color value for every pixel. In [11], the observed value for every pixel depends on several effect components descriptions (ECD) as following, Vobsv C N sys M obj M bgd Sillum Dcam (1) Where, Vobsv is the observed values of the scene, C is the ideal background scene, N-sys is the system noise, M obj is moving object, M bgd is moving background and Sillum is the long term illumination change and Dcam is camera displacement. Considering the scene understanding and background generation requirement in surveillance video mentioned in the above, here we omit the camera movement and background movement. Both the noise from image sensor, Nsys and long term illumination change are included in N-noise. Then a simplified observed model for background generation and foreground extraction is given as following, 3 Vobsv vbackground N noise v forground (2) N-noise can be considered as variable range for background pixel intensity value confirmation. Mean shift algorithm is used to search the most emerging frequency for every pixel value in [0,255] in [11]. As for the noise in scene understanding in surveillance video, the current pixel value in background should be in a small range. And a sliding window mechanism might be more close to human perception on background. For example, we might make the primary decision on whether the current pixel belongs to background after the start frames in a video. With the displaying of video, just like moving a sliding window, if there is no change or only small change and the change lasts very short, we will confirm our decision. So here we give a new and intuitionistic definition of sliding window and a binary tree searching algorithm is used in the operation of sliding window instead of just moving the sliding window step by step with the increase of frame number in [13]. Wi hi li Figure 3. Sliding window for background generation The red rectangle sliding window superimposed on pixel intensity value curve is shown in figure 3 with several attributes: every background sliding window is defined with the following attributes including: Sliding window length (li), height (hi), mean value ( i ), standard square error ( i ) and frequency number of all the pixels emerging in this windows (ni ). Then background generation is equal to find the best result of the following restricted program, min i ni , i 1,..., k (3) 0 s.t. i ni n0 0 and n0 const _ value Where k is the current window number, 0 and n0 are constant value. Since background is viewed to be stable and emerge often in video sequence, we think background part will definitely emerge in the preceding part or the latter part of the video or both. A sliding window searching method is proposed here but the sliding window is moved with binary tree searching algorithm in ‘jumping’ mode not ‘sliding’ as that in [13]. In [15], some seeding points are first chosen and after that sliding window is constructed whose center is those seeding points. Our proposed method is more simple and intuitionistic as shown in figure 4. 4 left right 2N 2 N i ... 2 N i Figure 4. Binary tree searching The searching mode can be described with the following pseudo code, Step 1. to choose 2N frames in surveillance video for background generation; Step 2. to divide the original searching range into left and right, each with length 2N-1, scale=N-1; Step 3. If (scale is small enough (scale <2)) goto step 5; Else goto step 4; End Step 4. Tkleft,right k nk If Tleft < Tright left frames is viewed as the total searching frames; Else right frames is viewed as the total searching frames; End scale -= 1; goto step 3; Step 5. to calculate the average value in the current sliding window; Step 6. Stop background generation. After all pixels of background are searched, a median filter is used to reduce the points which should belong to background but emerge in foreground. Some background pixel values might be estimated with several key frames for saving calculation, for example, the pixels in top-left corner in the following figure 5 are always stable in the whole video sequence. Figure 5 shows the results of background generation. Figure (a) and (b) are two examples of the results form the above background generation algorithm. Three images in a video are shown and the right bottom is the generated background. 5 (a) Background generation: example 1 (b) Background generation: example 2 Figure 5. Examples of background generation 2.2 Foreground extraction Foreground can be extracted by comparing the background and the current frame. With considering the noise in scene understanding, foreground could be extracted with the following equation derived from equation (2), V forground Vobsv Vbackground N noise (4) Figure 6 shows some results from foreground extraction. We can see that the objects in foreground are extracted except for only a few points in those objects. Those points have been decided as background because their intensity and color is almost equal to that of background, for example, in the marked blue circle, some points are lost in the foreground objects. With some other information, such as motion vector in the same object or in the same macro block, this case could be improved. Since motion saliency based on motion vector will be analyzed later in this paper, the current foreground objects are acceptable for stationary saliency analysis. 6 Figure 6. Foreground extraction. 3. Multi-feature perception model for saliency detection 3.1 Framework and algorithms Figure 7. Framework for saliency detection Figure 7 shows the framework for saliency detection in surveillance video. Based on the results from background generation and foreground extraction, stationary saliency is computed via multi-feature conspicuity including face and some low level features such as color intensity and orientation; motion saliency is calculated based on motion analysis and distance effect on visual perception. Algorithms are include in the following three sections: (1) Stationary saliency model with face as a high level feature and intensity, color, orientation as low-level features; (2) Motion saliency model with motion vector field measurement and distance weights in Gaussian model; (3) Fuse method for stationary saliency map and motion saliency map. 3.2 Multi-feature stationary saliency Low level features such as intensity, color and orientation features can contribute much to our attention in Itti’s bottom-up attention framework. Every feature is analyzed using Gaussian pyramid and multi-scales. 7 feature maps are generated including one intensity, four orientations (at 0, 45, 90,135 degree) and two color components (red/green and blue/yellow) conspicuous maps. After a normalization step, all those feature maps are summed to 3 conspicuous maps including intensity conspicuous map C i , color conspicuous map C c and 7 orientation conspicuous map C o . Finally the saliency maps are combined together to get the saliency maps according to the following equation SItti= 1 Ck 3 k i , c , o (5) Faces are features which focus more attention than other features in many images. Psychological tests have proven that face, head or hands can be perceived prior to any other details [16]. So faces can be used as high level features for saliency map. One drawback of Itti’s visual attention mechanism model is that its saliency map model is not well adapted for images with faces. Several studies in face recognition have shown that skin hue features could be used to extract the face information. To detect heads and hands in images, we have used the face recognition and location algorithm used by Walther et al [17]. This algorithm is based on a Gaussian model of the skin hue distribution in the (r’, g’) color space as independent feature. For a given color pixel (r’, g’), the model’s hue response is then defined by the following equation, 1 (r ' ) 2 ( g ' g ) 2 (r ' r )( g ' g ) r h(r ' , g ' ) exp 2 2 2 r g r g r' r r g b and 2 Where ( r , g ) is the average of the skin hue distributions, r and g' g r g b (6) (7) g2 are the variances of the r’ and g’ components, and is the correlation between the components r’ and g’. These parameters had been statistically estimated from 1153 photographs which contained faces. The function h(r ' , g ' ) can be considered as a color variability function around a given hue. Then stationary saliency based on multi-features conspicuity can be described as following, SS = f ( S Itti , S Face ) (8) Here we choose adding weights model as following, SS 1 2Ci 2Cc Co 3CF 8 (9) For most of images containing faces, heads or hands, the model with skin hue detection gives better results than the Itti’s model, i.e. more accurate saliency maps. The example given in this paper shows the difference between Itti’s model and the stationary model for face images. ‘I18’ is the original reference image including face, eyes and hands in TID2008 [21] as shown in figure 8 (a). Figure 8 (b) showed the saliency map from the mixed model and figure (c) showed the saliency map from Itti’s model. The result from the mixed model seems more reliable. Besides some common place, figure (b) should be close to our real perception since we usually focus on the neighbor around eyes because we trended to observe, find and understand the expression on face. 8 Reference I18 of TID2008 saliency map with skin hue detection of I18 in TID2008 saliency map without skin hue detection of I18 in TID2008 60 60 50 100 150 150 200 250 50 50 50 100 40 200 40 150 height 100 height height 50 30 250 200 30 250 20 300 20 300 300 10 350 10 350 50 100 150 200 250 300 width 350 400 450 500 350 100 200 300 400 500 100 width 200 300 400 500 width (a) I18 image in TID2008 (b) The saliency map of mix model. (c) The saliency map from Itti’s model. Figure 8. Saliency region from Itti’s model and mixed model. 3.3 Motion saliency map and Gaussian weights model Motion feature must be involved in video saliency map as it plays a very important role in surveillance video. With motion perception, we could know what is happening in the current scene. And some regions or objects are not so salient in video although they might be salient in images, for example, the rich texture minutia of the object in images will be omitted in videos with fast motion. In this paper, motion information of a video is analyzed with its motion vector field which can be calculated using motion estimation with more than one reference image. Here we used full searching and block matching to find the best motion vectors which are normally used in video compressing. The following shows an example of motion estimation and motion vector field. (a) #62 frame in video sequence (b) motion vectors of #62 frame Figure 9. Motion vectors field Here #62 frame is viewed as the current frame, and its previous frame #61 frame is shown in figure (a). We can see that there are motion in the blue circle because of light flicker or other noise although they should be still. It is fortunate that the effect of those pseudo motions can be cut off with background extraction. Based on the motion vector field, the intensity of motion vector, spatial coherence and temporal coherence of the motion are used to describe motion saliency map [8]. The intensity of motion vector is computed with the following equation, I Besides the intensity, the phase mvx2 (mvy)2 (10) of motion vector will be also analyzed. arctan( mvy ) mvx (11) will distribute in [0, 360] after normalization. Within the motion vector field, motion vector of the blocks neighboring the current block C such as block A and B etc. will also be analyzed. The distribution probability density i is computed using the histogram 9 distribution of value within the overlapped neighboring field with size of k x k, for example 7x7 is used here. Figure 10 shows an example of the motion vectors in neighborhood and its corresponding phase values histogram distribution as following, (a) Motion vector field in 7th frame (b) Histogram of the neighborhood of 145th MB Figure 10. Motion vector field and phase value histogram, in figure (a), the central blue rectangle means the current MB and the red rectangle means neighborhood, here 9 bins for histogram The spatial motion saliency Cs is calculated as following, N Cs i lg i (12) i 1 N is the number of histogram bins of value in kxk field. Temporal saliency map Ct can be defined in the same way. Then motion saliency map is computed in the following equation SM I Ct (1 I Cs) (13) I is motion intensity by computing the magnitude of motion vector, Ct is the temporal coherence inductor and Cs is the spatial coherency based on spatial phase histogram and temporal phase histogram statistic and analysis. 3.4 Fuse model of stationary and motion saliency map The stationary saliency map SS and motion saliency map SM can be fused to obtain the final saliency map of every frame in a video. Since we usually more easily focus on those objects emerging into the center of observing window than that is far away from the center, we propose a distance weight fusing model as following, SV _ G S M wi _ mb 1 S s (14) wi _ mb ed / 8 d xc xi xc 2 yi yc 2 width _ mb height _ mb ; yc 2 2 10 (15) Where width_mb and height_mb are the height and width in mb (16x16), wi _ mb is the block distance weight in 2D Gaussian distribution shown in figure 11. (a) block position (b) 2-D Gaussian distance model Figure 11. Block distance weights model Beside the above fusing method, some other fuse mode are also designed for comparison including Mean, Max and pixel multiplication fusion mode as following, SV _ mean ( x, y, k ) S M SS 2 (16) SV _ max ( x, y, k ) Max ( S M , S S ) (17) SV _ multip( x, y, k ) S M S S (18) 4. Experiments In this section, several surveillance video sequences recorded by ourselves are used in the subjective experiments and our visual perception model. The target of video saliency detection is to make the video saliency map from our model mimic the gaze map derived from subjective experiments. 20 observers joined in the experiments with a SMI remote eye tracker [18]. First the fixations corresponding to each image will be recorded with the eye tracker and its software suite. Then the fixations are normalized and Gaussian filtered to obtain a gaze map. The final gaze map is achieved by taking the mean of all the 20 gaze maps from observers. An examples is given to show the gaze maps from the original image in figure 12. Figure (a) is the original image, figure (b) is the corresponding gaze from subjective experiments and figure (c) is the mixed image with gaze map superimposed over the original image. (a) Original image (b) Gaze map (c) Mixed image of gaze map and image Figure 12. Original image, gaze map and mixed image with gaze map superimposed over original image Figure 13 shows two frame images in surveillance video and the corresponding mixed images with gaze maps and saliency map. Figure (a) and (d) are two original frames in video; (b) and (e) are the mixed images of gaze map and the original image; (c) and (f) are two mixed image with saliency map superimposed over the original image. 11 (a)Original image (b) Gaze map (c) Saliency map (d) Original image (e) Gaze map (f) Saliency map Figure 13. Frame image gaze map and saliency map Besides subjective comparison, various criteria could be used for the comparison such as some distance measurement and ROC [19]. Here NSS (Normalized Scanpath Saliency) is used to estimate the relationship between gaze map and saliency map as that in [7, 20]. For example, NSS can be used to compare gaze map and mean saliency map in equation (16) as following, NSS k SV _ m _ G SV _ m ( x, y, k ) S (19) V _ m ( x , y ,k ) SV _ m _ G ( x, y, k ) GV x, y, k SV _ m ( x, y, k ) (20) Where GV x, y, k is the human eye gaze map normalized to obtain unit mean, and SV _ m ( x, y , k ) is the saliency map from detection model. S V _ m ( x, y ,k ) is the standard square error. NSS is a standard score criteria which expresses the divergence of the experimental result from the mean saliency map as a number of standard deviations of the model. The larger the value of the score, the less probable it is that the experimental result is due to chance. SV _ m _ G ( x, y, k ) can be viewed as the mean of the saliency values at eye position and SV _ m ( x, y, k ) can be viewed as the mean of the saliency values on the whole frame. If NSS is positive, that means eye postion tends to be on salient regions; if NSS is negative, that means eye position tends to be on non-salient regions; if NSS is null, that means there is no link between eye position and saliency. In [22], the subjective gaze map is called as real eye movement gaze map, besides this gaze map, another randomized eye movement gaze map is also introduced for comparison. The randomized eye gaze map mean to associate to a frame of the current video the eye movement of subjects when they were looking at another video clip. Since NSS is used to compare the saliency map from our model with gaze map, if our model can predict the eye movement well, NSS of real eye movement and saliency map should be high and NSS of randomized eye movement and saliency map should low at the same time. Table 1 gives out some data about NSS with real eye movement or randomized eye movement. Here we considered four saliency map derived from different 12 weights fusing methods for stationary saliency map and motion saliency map. Table 1. Gaze map and saliency map comparison Fuse mode Smean Smax Smulti Criteria p NSS on real gaze map 0.36 0.3 0.15 7 12 0 NSS on randomized gaze map 0.02 0 0.0 -0.15 SGaussian 1.066 0.195 32 We also compare our result with other saliency detection algorithm including itti’s model, frequency tune saliency detection and phase spectrum saliency as shown in table 1. Table 2. Comparison among saliency models model IT FT(Frequency Tune) Phase Spectrum Proposed criteria NSS on real gaze 0.123 0.160 0.002 1.066 map NSS on randomized 0.136 0.189 -0.044 0.195 gaze map According to the above data, NSS on real gaze map derived from our proposed model is higher than that from other methods such as Itti, frequence tune and phase spectrum ; NSS on randomized gaze map is lower than that from other methods. And the results from Gaussian distance weighting is the best. The reason is that Ittit' model , frequency tune and phase spectrum saliency detection are for stationary saliency without motion information. Our proposed saliency perception model includes motion information with distance weights adding. The results show that our proposed method could generate result close to the subjective gaze map. The results also show that motion is very important for the visual perception in video and should be fully used. And saliency perception in video is much different from image saliency perception because of motion. 5. Conclusion In this paper, a new visual saliency detection algorithm oriented surveillance video is proposed. With the knowledge of scene understanding in surveillance video, background generation and foreground objects extraction are analyzed, and then multi-features including high level feature such as face and other low level feature including color, orientation and intensity have been used to construct stationary feature conspicuity. Motion saliency map is based on the motion vector analysis and motion saliency map and stationary saliency map are fused with Gaussian distance weights. Compared saliency map with the gaze map of surveillance videos from subjective experiments, the output of the multi-feature based video saliency detection model is close to gaze map. Here we mainly consider surveillance video with stable background, next we will focus on more complicated scene, such as the background and foreground object are both moving, more refined algorithm is necessary to get the suitable foreground object for saliency analysis. And the current binary tree searching background pixel might involve too much calculation, so next the neighborhood information and multi-scale technique will be researched for optimization. References [1]L.Itti, C.Koch and E.Niebur, “A model of saliency-based visual attention for rapid scene analysis” IEEE Trans. PAMI., vol. 20, No.11, pp.1254-1259, Nov. 1998. [2] Rajashekar, U.; van der Linde, I.; Bovik, A.C.; Cormack, L.K, "GAFFE: A gaze-attentive fixation finding engine," IEEE Trans Image Processing, vol. 17, No.4, pp. 564-573. 13 [3] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. “Frequency-tuned saliency detection model”, CVPR2009. [4] Qi Ma and Liming Zhang. "Saliency-Based Image Quality Assessment Criterion", ICIC 2008, LNCS 5226, pp. 1124–1133, 2008. [5] L.-J. Li and L. Fei-Fei. “What, where and who? Classifying event by scene and object recognition”. IEEE Intern. Conf. in Computer Vision (ICCV). 2007. [6] Brian Michacel Scacellat. “Theory of Mind for a Humanoid Robot”, Autonomous Robert, vol. 12, No.1, pp.13-24, 2002 [7] S.Marat, T.Ho Phuoc. Spatio-temporal saliency model to predict eye movements in video free viewing, 16th European Signal Processing Conference EUSIPCO-2008, Lausanne: Suisse (2008) [8] Yufei Ma, Hongjing Zhang. A model of motion attention for video skimming. ICIP 2002. [9] Shan Li, Lee, M.C. “Fast Visual Tracking using Motion Saliency in Video”, ICASSP. vol.1, pp.1073-1076. 2007 [10] L.-J. Li, R. Socher and L. Fei-Fei. Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework. Computer Vision and Pattern Recognition (CVPR) 2009. [11] Liu Yazhou, Hongxun Yao, Wengao, xilin chen, debin zhao, “non parametric background generation”, Journal of visual communication and image representation, 18 (2007), 253-263. [12] Desihe Sidide and Oliver Strauss. “A fast and automatic background generation method from a video based on QCH”, Journal of visual communication and image representation, April. 2009. [13] Hanzi Wang, David Suter. “A novel robust statistical method for background initialization and visual surveillance”, ACCV 2006, LNCS 3851, pp.328-337, 2006. [14] M. Pinson and S. Wolf, “Comparing subjective video quality testing methodologies,” SPIE Video Communications and Image Processing Conference, Lugano, Switzerland, Jul. 8-11 2003. [15] Alan J.Lipton, Niles Haering, Mark C.Almen, Peter L. Venetianer, Thomas E.Slowe. Zhong Zhang .Video scene background maintenance using statistical pixel modeling. United States Patent Application Publication. Pub. No.: US 2004/0126014 A1, Jul.1, 2004. [16]R Desimone, TD Albright, CG Gross and C Bruce. "Stimulus selective properties of inferior temporal neurons in the macaque", Journal of Neuroscience, vol4, 2051-2062, 1984. [17] Walther, D., Koch, "Modeling Attention to Salient Proto-objects", Neural Networks 19, 1395–1407, 2006 [18]Puneet Sharama. “Perceptual image difference metrics-saliency maps & eye tracking”, Jan.20, 2008. [19]B.W.Tatler, R.J.Baddeley and I.D.Gilchrist, ‘visual correlates of fixation selection: effects of scale and times’, vision research, vol. 45, pp.643-659, 2005. [20] R.j. Peters, A.Iyer, L.Itti and C.Koch. ‘Components of bottom-up gaze map allocation in natural images’, vision research, vol.45, pp.2397-2416, 2005. [21] TID2008 page: http://www.ponomarenko.info/tid2008.htm. [22] S.Marat, T.Ho Phuoc, L.Granjon, N.Guyader. Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, vol.82, No.3, pp.231-243,2009. 14