Perceptual image distortion metric based on a statisticallyderived divisive normalization model 1 1 Roberto Valerio , Rafael Navarro and Bart M. ter Haar Romeny 2 1 Instituto de Óptica “Daza de Valdés” - CSIC, Madrid, Spain, 28006 {r.valerio, r.navarro}@io.cfmac.csic.es 2 Department of Biomedical Engineering - Eindhoven University of Technology, Eindhoven, The Netherlands, 5600 MB B.M.terHaarRomeny@tue.nl Abstract. We present a perceptual image distortion metric based on recent models of primate primary visual cortex (V1). The perceptual metric is similar to that proposed by Teo and Heeger (1994) and includes a linear filtering stage followed by a gain control mechanism, known as “divisive normalization”, that explains some of the non-linear behaviour of V1 neurons. The main difference is that in our case, following the latest V1 models, the divisive normalization is more general (it considers not only neighbouring responses in orientation but also in position and scale) and also it is adapted to natural image statistics. In particular, the parameters of the divisive normalization are fixed using a novel statistically-derived model of minimum noticeable distortions in squared linear coefficients of natural images. The results show that the proposed metric fits very well (as well as the metric by Teo and Heeger, 1994) empirical data obtained from contrast masking experiments. Keywords: Perceptual quality metrics, non-linear models of V1 neurons, divisive normalization, natural image statistics. 1. Introduction In many different image processing applications the properties of the human visual system (HVS) can be exploited to improve the performance from a visual quality point of view. The quality improvement that can be achieved using an HVS-based approach is significant and applies to a broad range of applications. In the last three decades, a great deal of effort has gone into the development of quality assessment methods that take advantage of known characteristics of the HVS. Reviews on perceptual image quality assessment algorithms can be found in Eckert and Brandley (1998) and Pappas and Safranek (2000). The common element in these algorithms is always a computational model of human vision. In recent years, various authors have shown that the non-linear behaviour of V1 neurons in primate visual cortex can be modelled by including a gain control stage, known as “divisive normalization” (Bonds, 1989; Geisler and Albrecht, 1992; Heeger, 1992; Carandini, Heeger and Movshon, 1997), after a linear filtering step. Divisive normalization not only can be used to describe the non-linear response properties of neurons in visual cortex, but also yields image descriptors more relevant from a perceptual point of view (Foley, 1994; Teo and Heeger, 1994). Very recently, Simoncelli and coworkers (Simoncelli and Schwartz, 1999; Schwartz and Simoncelli, 2001; Wainwright et al., 2002) presented a statistically-derived divisive normalization model. In addition to its utility to characterize the non-linear response properties of neurons in sensory systems, and thus to demonstrate that early neural processing is well matched to the statistical properties of the stimuli, they showed empirically that the statistical normalization model strongly reduces pairwise statistical dependences between responses. In this paper, we present a perceptual image distortion metric similar to that proposed by Teo and Heeger (1994). Our perceptual metric is based on a statistically-derived divisive normalization model of V1 neuron responses and has two main differences with respect to that by Teo and Heeger (1994): first, the divisive normalization considers not only neighbouring responses in orientation but also in position and scale and second, the parameters of the divisive normalization are adapted to natural image statistics (through a novel statistically-derived model of minimum noticeable distortions in squared linear coefficients of natural images) instead of being fixed exclusively to fit psychophysical data. We show that the proposed metric fits very well (as well as the metric by Teo and Heeger, 1994) 1 psychophysical data describing masking of a Gabor function by sinusoidal gratings. The fit is much better than that of other simpler perceptual metrics. 2. Perceptual metric The perceptual metric we proposed here consists of the following three stages: 2.1. Linear stage The linear stage is an approximately orthogonal four-level linear decomposition based on symmetric quadrature mirror filters (QMF) with 9 coefficients (Simoncelli and Adelson, 1990), which are closely related to wavelets (essentially, they are approximate wavelet filters). The basis functions of this linear transform are localized in space, orientation and spatial frequency. This gives rise to 12 subbands (horizontal, vertical and diagonal for each of the 4 scales considered here) plus an additional low-pass channel. Multiscale linear transforms like this are very popular for image representation. 2.2. Non-linear stage The non-linear stage consists basically of a divisive normalization, in which the responses of the previous linear filtering stage, ci, are squared and then divided by a weighted sum of squared neighbouring responses in space, orientation and scale, {c 2j } , plus a constant, d i2 (Simoncelli and Schwartz, 1999): ri ci2 (1) d i2 eij c 2j j i If we set arbitrary at one the threshold at which distortion is visible, then the minimum noticeable distortion c i2 is d i2 eij c 2j . Hence, the parameters of the divisive normalization (constant d i2 j i and weights {eij}) can be obtained from a model of minimum noticeable distortions c i2 . We propose the following statistically-derived model for c i2 given the neighbouring coefficients {c 2j } : p 1 c c z 2 2 2 i 2 i 2 if ci2 ci2 2 ci (2) y 2 ci 2 2 2 ci2 y p z 2 ci 2 if ci2 ci2 y ci2 ci2 , where Φ( ) is the Gaussian cumulative distribution function with unity standard deviation and z( ) is its inverse function. Figure 1 shows a plot of this model. ci2 4 3 2 1 0 ci2 0 1 2 3 4 Figure 1. Minimum noticeable distortion model with p = 0.5 and c2 = 1. i 2 Eq. 2 gives for each value of c i2 the value c i2 that yields an error probability p (that is, the probability that the random variable c i2 is in the interval [ c i2 - c i2 , c i2 + c i2 ]), assuming that the conditional probability p(ci | {c 2j }) is Gaussian with zero mean and variance c2i . If we fix p to 0.5, then the mean of c i2 over c i2 is ci2 c2i . This means, according to what we discussed above, that a good choice for the parameter values of the divisive normalization (constant 2 eij c 2j a good estimator of c2i . It is important to note d i2 and weights {eij}) is that that makes d i j i that other authors (Schwartz and Simoncelli, 2001; Wainwright et al., 2002) have used ad hoc this choice of parameters. 2.3. Error pooling The final stage computes a Minkowski sum with exponent 2 of the differences ri (multiplied by constants ki that adjust the overall gain) between the non-linear outputs from the reference image and the non-linear outputs from the distorted image: r k 2 i ri 2 (3) i 3. Results To test the perceptual metric we have used empirical data from Teo and Heeger (1994) obtained from contrast masking experiments conducted by Foley and Boynton (1994). The task in these experiments is to detect a target pattern superimposed on a masker pattern. The maskers are 2 cycles per degree (cpd) sinusoidal gratings of several orientations (0, 11.25, 22.5, 45 and 90 degrees). The target is a vertically oriented 2 cpd Gabor patch with vertical and horizontal 1/e halfwidths of 0.5 degrees. The target and the masker are presented simultaneously and viewed from a distance of 162 cm. We created the corresponding digital images very easily using the program Discrim by Landy (2003). To fix the parameters of the divisive normalization we used a “training set” of six B&W natural images with 512x512 pixel format (“Boats”, “Elaine”, “Goldhill”, “Lena”, “Peppers” and “Sailboat”). We considered a 12-coefficient neighbourhood {c 2j } of squared adjacent coefficients to ci along the four dimensions (8 in a square box in the 2D space, 2 in orientation and 2 in scale), and we used maximum-likelihood (ML) estimation independently for each subband of the QMF pyramid. On the other hand, the gains ki (one constant for each subband of the QMF pyramid) in the error pooling stage were determined by fitting the metric outputs to the psychophysical data. Left panel in figure 2 shows the results for a masker orientation of 0 degrees. As we can see, the fit to the data is extremely good. Our metric yields much better results than simple perceptual metrics, such as the “single filter, uniform masking” (SFUM) model by Ahumada (1996) (see right panel in figure 2). 0-degree masker 0-degree masker -10 Target threshold contrast (dB) Target threshold contrast (dB) -10 -20 -30 -40 -50 -40 -30 -20 Masker contrast (dB) -10 -20 -30 -40 -50 0 -40 -30 -20 Masker contrast (dB) -10 0 Figure 2. Results of fitting our metric (left) and the SFUM model (right) to empirical data. Empirical data are denoted by circles. Solid curves denote predicted target thresholds contrasts. 3 While the fits to the 0-degree and 11.25-degree data are impressive, the fits to the other curves are not as good. This is caused by the relatively broad orientation bandwidth of the QMF’s (see Teo and Heeger, 1994). One important characteristic of the contrast masking data is the presence (or absence) of a “dipper” that indicates that, within that range of masker contrasts, the masker facilitates the detection of the target. The particular nonlinearity used in our metric (Eq. 1) permits to fit the dipper quite well, which is not the case of other non-linear functions (for example, if we simply take the square root in Eq. 1). 4. Summary and conclusions We have presented a perceptual image distortion metric based on a statistically-derived divisive normalization model of V1 neuron responses. Parameters of the divisive normalization have been determined from natural image statistics using a novel statistically-derived model of minimum noticeable distortions in squared linear coefficients of natural images. The resulting statistical way of fixing the divisive normalization parameters is in complete agreement with the accepted hypothesis that sensory systems are adapted to the signals to which they are exposed and also has been used ad hoc in the literature. An important difference with other similar schemes is that the neighbourhood considered in the divisive normalization contains image linear coefficients belonging to different positions, orientations and scales. This permits the model to implement many intraband and interband masking mechanisms. Finally, the results show that the perceptual metric fits very well psychophysical data from classical contrast masking experiments, and we expect even better results in more realistic experiments with natural stimuli. Acknowledgements This research was supported by the Spanish Commission for Research and Technology (CICYT) under grant DPI200204370-C02-02. Roberto Valerio was supported by a Madrid Education Council and Social European Fund Scholarship for Training of Research Personnel, and by a City Hall of Madrid Scholarship for Researchers and Artists in the Residencia de Estudiantes. References Ahumada, A. J., Jr. 1996. Simplified vision models for image quality assessment. In SID International Symposium Digest of Technical Papers. Ed. Morreale, J. Society for Information Display, 27: 397-400. Bonds, A. B. 1989. Role of inhibition in the specification of orientation selectivity of cells in the cat striate cortex. Visual Neuroscience, 2: 41-55. Carandini, M., Heeger, D. J. and Movshon, J. A. 1997. Linearity and normalization in simple cells of the macaque primary visual cortex. J. Neuroscience, 17: 8621-8644. Eckert, M. P. and Brandley, A. P. 1998. Perceptual quality metrics applied to still image compression. Signal Processing, 70: 177-200. Foley, J. M. 1994. Human luminance pattern mechanisms: masking experiments require a new model. Journal of the Optical Society of America A, 11: 1710-1719. Foley, J. M. and Boynton, G. M. 1994. A new model of human luminance pattern vision mechanisms: Analysis of the effects of pattern orientation, spatial phase, and temporal frequency. In Computational Vision Based on Neurobiology. Ed. Lawton, T. A. SPIE Proceedings, 2054. Geisler, W. S. and Albrecht, D. G. 1992. Cortical neurons: Isolation of contrast gain control. Vision Research, 8: 1409-1410. Heeger, D. J. 1992. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9: 181-198. Landy, M. S. 2003. A tool for determining image discriminability, http://www.cns.nyu.edu/~msl/discrim/ /discrimpaper.pdf. Pappas, T. N. and Safranek, R. J. 2000. Perceptual criteria for image quality evaluation. In Handbook of Image and Video Proc. Ed. Bovik, A. Academic Press. Schwartz, O. and Simoncelli, E. P. 2001. Natural signal statistics and sensory gain control. Nature neuroscience, 4(8): 819-825. Simoncelli, E. P. and Adelson, E. H. 1990. Subband image coding. Subband Transforms. Ed. Woods, J. W. Kluwer Academic Publishers. Chapter 4: 143-192 Simoncelli, E. P. and Schwartz, O. 1999. Modeling surround suppression in V1 neurons with a statisticallyderived normalization model. Advances in Neural Information Processing Systems, 11: 153-159. Teo, P. C. and Heeger, D. J. 1994. Perceptual image distortion. Human Vision, Visual Processing, and Digital Display V. B. Eds. Rogowitz, B. E. and Allebach, J. P. Proc. SPIE, 2179: 127-141. Wainwright, M. J., Schwartz, O. and Simoncelli, E. P. 2002. Natural image statistics and divisive normalization: modeling nonlinearities and adaptation in cortical neurons. Statistical Theories of the Brain. Eds. Rao, R., Olshausen, R., and Lewicki, M. MIT Press. Chapter 10: 203-222. 4