A perceptual quantization strategy for HEVC based on a convolutional neural network trained on natural images Md Mushfiqul Alam∗a , Tuan D. Nguyena , Martin T. Hagana , and Damon M. Chandlerb a School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK, USA; b Dept. of Electrical and Electronic Engineering, Shizuoka University, Hamamatsu, Japan ABSTRACT Fast prediction models of local distortion visibility and local quality can potentially make modern spatiotemporally adaptive coding schemes feasible for real-time applications. In this paper, a fast convolutional-neuralnetwork based quantization strategy for HEVC is proposed. Local artifact visibility is predicted via a network trained on data derived from our improved contrast gain control model. The contrast gain control model was trained on our recent database of local distortion visibility in natural scenes [Alam et al. JOV 2014]. Furthermore, a structural facilitation model was proposed to capture effects of recognizable structures on distortion visibility via the contrast gain control model. Our results provide on average 11% improvements in compression efficiency for spatial luma channel of HEVC while requiring almost one hundredth of the computational time of an equivalent gain control model. Our work opens the doors for similar techniques which may work for different forthcoming compression standards. Keywords: HEVC, distortion visibility, convolutional neural network, natural images, visual masking 1. INTRODUCTION Most digital videos have undergone some form of lossy coding, a technology which has enabled the transmission and storage of vast amounts of videos with increasingly higher resolutions. A key attribute of modern lossy coding is the ability to trade-off the resulting bit-rate with the resulting visual quality. The latest coding standard, high efficiency video coding (HEVC),1 has emerged as an effective successor to H.264, particularly for high-quality videos. In this high-quality regime, an often-sought goal is to code each spatiotemporal region with the minimum number of bits required to keep the resulting distortions imperceptible. Achieving such a goal, however, requires the ability to accurately and efficiently predict the local visibility of coding artifacts, a goal which currently remains elusive. The most common technique of estimating distortion visibility is to employ a model of just noticeable distortion (JND);2–4 and such JND-type models have been used to design improved quantization strategies for use in image coding.5–8 However, these approaches are based largely on findings from highly controlled psychophysical studies using unnatural videos. For example, Peterson et al.9 performed an experiment to determine visual sensitivity to DCT basis functions in the RGB and YCrCb color spaces; and, these results have been used to develop distortion visibility models.5, 10–12 However, the presence of a video could induce visual masking, thus changing the visual sensitivity to the DCT basis functions. Several researchers have also employed distortion visibility models specifically for HEVC and H.264. Naccari and Pereira proposed a just-noticeable-difference (JND) model for H.264 by considering visual masking in luminance, frequency, pattern, and temporal domains.13 However, the coding scheme was based on Foley and Boynton’s14 contrast gain-control model which did not consider the gain control in V1 neurons. In another study, Naccari et al. proposed a quantization scheme for HEVC to capture luminance masking based on measurements of pixel intensities.15 Blasi et al. proposed a quantization scheme for HEVC which took into account visual masking by selectively discarding frequency components of the residual signal depending on the input patterns.16 However, the scheme of frequency-channel discarding was not based on psychophysical measurements. Other researchers, for example Yu et al.17 and Li et al.,18 have used saliency information, rather than visual masking to achieve higher compression for HEVC. *mdma@okstate.edu; phone: 1 718 200-8502; vision.okstate.edu/masking Applications of Digital Image Processing XXXVIII, edited by Andrew G. Tescher, Proc. of SPIE Vol. 9599, 959918 · © 2015 SPIE · CCC code: 0277-786X/15/$18 doi: 10.1117/12.2188913 Proc. of SPIE Vol. 9599 959918-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx Thus, one potential area for improved coding is to develop and employ a visibility model which is not just designed for naturalistic masks (natural videos), but also designed to adapt to individual video characteristics (see, e.g., Refs. 19–23, which aim toward this adaptive approach for images). Another area for improvement is in regards to runtime performance. Operating fuller distortion visibility models in a block-based fashion, so as to achieve local predictions, is often impractical due to prohibitive computational and memory demands. Thus, predicting local distortion visibility in a manner which is simultaneously more accurate and more efficient than existing models remains an open research challenge. Here, we propose an HEVC-based quantization scheme based on a fast model of local artifact visibility designed specifically for natural videos. The model uses a convolutional-neural-network (CNN) architecture for predicting local artifact visibility and quality of each spatiotemporal region of a video. We trained the CNN model using our recently published database of local masking and quality in natural scenes.24, 25 When coupled with an HEVC encoder, our results provided 11% more compression over baseline HEVC when matched in visual quality, while requiring one hundredth the computational time of an equivalent gain control model. Our work opens the doors for similar techniques which may work for different forthcoming compression standards. 2. DISTORTION VISIBILITY PREDICTION MODEL-1: CONTRAST GAIN CONTROL WITH STRUCTURAL FACILITATION (CGC+SF) This paper describes two separate models for predicting distortion visibility: (1) A model of masking which operates by simulating V1 neural responses with contrast gain control (CGC); and (2) a model which employs a convolutional neural network (CNN). The training data of the CNN model are derived from the CGC model. Contrast masking26 has been widely used for predicting distortion visibility in images and videos.5, 11, 12, 27 Among the many existing models of contrast masking, those which simulate the contrast gain-control response properties of V1 neurons are most widely used. Although several gain control models have been proposed in previous studies (e.g., Refs. 26, 28–31), in most cases, the model parameters are selected based on results obtained using either unnatural masks31 or only a very limited number of natural images.20, 22 Here, we have used the contrast masking model proposed by Watson and Solomon.31 This model is well-suited for use a generic distortion visibility predictor because: (1) the model mimics V1 neurons and is therefore theoretically distortiontype-agnostic; (2) the model parameters can be tuned to fall within biologically plausible ranges; (3) the model can directly take images as inputs, rather than features of images as inputs; and (4) the model can operate in a spatially localized fashion. We improved upon the CGC model in two ways. First, we optimized the CGC model by using our recently developed dataset of local masking in natural scenes24 (currently, the largest dataset of its kind). Second, we incorporated into the CGC model a structural facilitation model which better captures the reduced masking observed in structured regions. 2.1 Watson-Solomon31 contrast gain control (CGC) model This subsection briefly describes the Watson and Solomon contrast gain-control model.31 Figure 1 shows the Watson and Solomon model flow. Both the mask (image) and mask+target (distorted image) go through the same stages: a spatial filter to simulate the contrast sensitivity function (CSF), a log-Gabor decomposition, excitatory and inhibitory nonlinearities, inhibitory pooling, and division.31 After division, the subtracted responses of the mask and mask+target are pooled via Minkowski pooling.31 The target contrast is changed iteratively until the Minkowski-pooled response equals a pre-defined “at threshold” decision (d), at which point, the contrast of the target is deemed the final threshold. The Watson and Solomon model is a model of V1 simple-cell responses (after “Divide” in Figure 1). Let z(x0 , y0 , f0 , θ0 ) denote the initial linear modeled neural response (log-Gabor filter output) at location (x0 , y0 ), center radial frequency f0 , and orientation θ0 . In the model, the response of a neuron (after division) tuned to these parameters is given by z(x0 , y0 , f0 , θ0 )p ∑ r(x0 , y0 , f0 , θ0 ) = g · q (1) q, b + (z(x0 , y0 , f0 , θ0 )) (x,y,f,θ)∈IN Proc. of SPIE Vol. 9599 959918-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx Inhibitory Nonlinearity Mask + Target CSF LogGabor Excitatory Nonlinearity Inhibitory Pooling Divide Subtract Mask CSF LogGabor Excitatory Nonlinearity Inhibitory Nonlinearity Change Target Contrast Divide Inhibitory Pooling No Minkowski Pool dˆ » d ? Yes Calculate Final Threshold Figure 1. Flow of Watson and Solomon contrast gain control model.31 For description of the model please see section 2.1. where g is the output gain,22 p provides the point-wise excitatory nonlinearity, q provides the point-wise inhibitory nonlinearity from the neurons in the inhibitory pool, IN indicates the set of neurons which are included in the inhibitory pool, and b represents the semi-saturation constant. Watson and Solomon31 optimized their model parameters using results of Foley and Boynton’s14 masking experiment using sine-wave gratings and Gabor patterns. However, as mentioned previously, we have recently developed a large dataset of local masking in natural scenes.24 Here, this dataset was used with a brute-force search to determine the optimum model parameters within biologically plausible ranges.22, 32 The parameters of the optimized gain control model are given in Table 1. Table 1. Parameters of the optimized gain control model. Parameters with (*) superscript were varied. Values of the parameters without (*) superscript were found from Watson and Solomon’s31 work and Chandler et al.’s22 work. Detailed description of the functions can be found in our previous work.24 Symbol fCSF BWCSF nf ∗ BWf ∗ f0G ∗ nθ BWθ θ0G p,q ∗ b g∗ sx,y sf sθ βr ,βf ,βθ d Description Value Contrast sensitivity filter33, 34 Peak frequency 6.0 c/deg log10 bandwidth 1.43 Log-Gabor decomposition Number of frequency bands 6 Frequency channel bandwidth 2.75 octave Center radial frequencies of the bands 0.3, 0.61, 1.35, 3.22, 7.83, 16.1 c/deg Number of orientation channels 6 Orientation channel bandwidth 300 Center angles of orientation channels 00 , ±300 , ±600 , 900 Divisive gain control Excitatory and inhibitory exponents 2.4, 2.35 Semi-saturation constant 0.035 Output gain 0.1 Inhibitory pool kernel: space 3 × 3 Gaussian, zero mean, 0.5 standard deviation24 Inhibitory pool kernel: frequency ±0.7 octave bandwidth with equal weights24 Inhibitory pool kernel: orientation ±600 bandwidth with equal weights24 Minkowski exponents: space, 2.0, 1.5, 1.5 frequency, and orientation “At threshold”decision 1.0 2.2 Structural facilitation (SF) model With the abovementioned parameter optimizations, the Watson and Solomon gain control model performed fairly well in predicting the detection thresholds from our local masking database; a Pearson correlation coefficient of 0.83 was observed between the ground-truth thresholds and the model predictions. However, the the model consistently overestimated the thresholds for (underestimated the visibilities of) distortions in regions containing recognizable structures. For example, Figure 2 shows how the optimized Watson and Solomon model overestimates the thresholds near the top of the gecko’s body, and in the child’s face. Proc. of SPIE Vol. 9599 959918-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx Experiment distortion visibility map Optimized contrast gain control -25 20 -30 40 Wrong predictions geckos -35 60 -40 80 -45 100 20 40 60 80 100 -20 20 -25 40 child_ swimming -30 60 Wrong predictions -35 80 -40 -45 100 20 40 60 80 100 Figure 2. The optimized Watson and Solomon model overestimated detection thresholds for distortion in image regions containing recognizable structures (e.g., near the top of the gecko’s body, and in the child’s face). The distortion visibility maps show the contrast detection thresholds in each spatial location of the images in terms of RMS contrast.35 The colorbars on the right denote the ranges of distortion visibilities for the corresponding image. 2.2.1 Facilitation via reduced inhibition Figure 2 suggests that recognizable structures within the local regions of natural scenes facilitate (rather than mask) the distortion visibility. This observation also occurred in several other images in our dataset (see Ref. 24 for a description of the observed facilitation). Here, to model this “structural facilitation,” we employ an inhibition multiplication factor (λs ) in the gain control equation: r(x0 , y0 , f0 , θ0 ) = g · bq + λs z(x0 , y0 , f0 , θ0 )p ∑ q. (z(x0 , y0 , f0 , θ0 )) (2) (x,y,f,θ)∈IN The value of λs varies depending on the strength of structure within an image. Figure 3 shows the change of Inhibition multiplier, ! 1 0.8 0.6 Weak structure (higher neuron inhibition) Strong structure (lower neuron inhibition) 0.4 0.2 0 0 0.02 0.04 0.06 0.08 Structure, • Figure 3. The inhibition multiplier λs varies depending on structure strength. Strong structures give rise to lower inhibition to facilitate the distortion visibility. Proc. of SPIE Vol. 9599 959918-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx inhibition in V1 simple cells depending on the strength of a structure. Recent cognitive studies36 have shown active feedback connections to V1 neurons coming from higher level cortices. Our assumption is that recognition of a structure results in lower inhibition in V1 simple cells via increased feedback from higher levels. For the ith block of an image, λsi is calculated via the following sigmoid function: [ ( ] ( Si (x,y)−p(S,80) )) 1 − 80 ∑M,N , max(S) > 0.04 & Kurt(S) > 3.5 x,y=1,1 1/ 1 + exp − 0.005 λsi = (3) 1, otherwise, where S denotes a structure map of an image, Si denotes ith block of S, M and N denote the block width and height, and p(S, 80) denotes the 80th percentile of S. The maximum value and kurtosis of the structure map S are used to determine whether there is localized strong structure within the image. The values of the parameters of equation 3 were determined empirically. 2.2.2 Structure detection The structure map S of an image is generated via the following equation which uses different feature maps: S = Ln × Shn × En × (1 − Dµ n )2 × (1 − Dσ n )2 . (4) Ln denotes a map of local luminance. Shn denotes a map of local sharpness.37 En denotes a map of local Shannon (first-order) entropy. Dµ n and Dσ n denote maps of the average and standard deviation of, respectively, fractal texture features38 computed for each local region. Each of these features was measured for 32 × 32-pixel blocks with 50% overlap between neighboring blocks. Each map was then globally normalized so as to bring all values within the range 0 to 1, and then resized by using bicubic interpolation to match the dimensions of the input image. Examples of structure maps computed for two images are shown in Figure 4. 0.09 0.09 geckos structure map, • child_swimming structure map, • 0.08 0.08 0.07 0.07 0.06 0.06 0.05 0.05 ir. 0.04 0.04 0.03 0.03 0.02 0.01 0.02 0.01 0.0 0 Figure 4. Structure maps of two example images. The colorbar at right denotes the structure strength at each spatial location of the structure map. 2.2.3 Performance of the combined (CGC+SF) model By incorporating the SF model with the optimized CGC model (as specified in Equation (2)), we were able to improve the model’s predictions for regions of images containing recognizable structures, while not adversely affecting the predictions for other regions. Figure 5 shows some results of using the combined (CGC+SF) model. Observe that the structural facilitation improved the prediction performance in local regions of images containing recognizable structures. For example, in the geckos image, near the top gecko’s skin, the predictions using the CGC+SF model better match the ground-truth thresholds as compared to using only the CGC model. Furthermore, the Pearson correlation coefficients between the CGC+SF model predictions and ground-truth thresholds also improved as compared to using the CGC model alone. 3. DISTORTION VISIBILITY PREDICTION MODEL-2: CONVOLUTIONAL NEURAL NETWORK (CNN) MODEL In terms of accuracy, the CGC+SF model performs quite well at predicting local distortion visibilities. However, this prediction accuracy comes at the cost of heavy computational and memory demands. In particular, the Proc. of SPIE Vol. 9599 959918-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx Experiment distortion visibility map CGC+SF: Optimized CGC: Optimized contrast gain control contrast gain control (with structural facilitation) -15 20 -20 40 -25 geckos 60 -30 80 -35 100 PCC: 0.68 20 40 PCC: 0.77 60 80 100 -15 -20 20 -25 40 child_ swimming -30 60 -35 80 -40 100 PCC: 0.85 20 40 PCC: 0.87 60 80 100 20 -20 40 foxy -25 60 -30 80 -35 100 PCC: 0.58 20 40 PCC: 0.63 60 80 100 -15 Pet'_ i' -20 20 -25 40 couple -30 60 -35 80 -40 100 PCC: 0.55 20 40 PCC: 0.60 60 80 100 Figure 5. Structural facilitation improves the distortion visibility predictions in local regions of images containing recognizable structures. Pearson correlation coefficient (PCC) of each map with the experiment map is shown below the map. CGC+SF model uses a filterbank consisting of 20 log-Gabor filters yielding complex-valued responses to mimic on-center and off-center neural responses. This filtering is quite expensive in terms of memory requirements, which easily exceed the cache on modern processors.39 In addition, the computation of the CGC responses is expensive, particularly considering that an iterative search procedure is required for the CGC+SF model to predict the thresholds for each block. One way to overcome these issues, which is the approach adopted in this paper, is to use a model based on a convolutional neural network40 (CNN). The key advantages of a CNN model are two-fold. First, a CNN model does not require the distorted (reconstructed) image as an input; the predictions can be made based solely on the original image. The distorted images are required only when training the CNN, and it is during this stage that characteristics of particular compression artifacts are learned by the model. The second advantage is that a CNN model does not require a search procedure for its prediction. 3.1 CNN Network architecture In a recent work we have proposed a CNN architecture (VNet-1)41 to predict local visibility thresholds for Gabor noise distortions in natural scenes. For VNet-1 the input patch dimension was 85 × 85 pixels (approximately 1.9◦ of visual angle), and the training data came from our aforementioned masking dataset.24 In this paper, we propose a modified CNN architecture (VNet-2) for predicting local visibility thresholds for HEVC distortions for individual frames of a video. Figure 6 shows the network architecture. The network contains three layers, C1: Convolution, S2: Subsampling, and F3: Full connection. Each layer contains one feature map. The convolution kernel size of the C1 layer is 19 × 19. The subsampling factor in layer S2 is 2. The dimensions of the input patch and feature maps of VNet-2 are M = N = 64, d1r = d1c = 46, d2r = d2c = 23. Proc. of SPIE Vol. 9599 959918-6 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx Input patch (• × •) Feature map ( !" × #") Layer C1: Convolution Feature map ( !$ × #$ ) Layer S2: Subsampling Output (distortion n visibility) Layer F3: Full connection Figure 6. The network architecture (VNet-2) used to predict the detection thresholds. For description of the architecture please see section 3.1. 3.2 CNN training parameters VNet-2 contains a total of 894 trainable parameters: The C1 layer contains 362 trainable parameters (19 × 19 kernel + 1 bias). The S2 layer contains 2 (1 weight + 1 bias) trainable parameters. The F3 layer contains 530 (23 × 23 weight + 1 bias) trainable parameters. For training VNet-2, the ground-truth dataset was created using the outputs of the CGC+SF model described in Section 2. Specifically, the CGC+SF model was used to predict a visibility threshold for each 64 × 64 block of the 30 natural scenes from the CSIQ dataset,34 and these predicted thresholds were then used as ground-truth for training the CNN model. Clearly, additional prediction error is imposed by using the CGC+SF model to train the CNN model. However, our masking dataset from Ref. 24 contained thresholds for 85 × 85 blocks, whereas thresholds for 64 × 64 blocks are needed for use in HEVC. Our CNN should thus be regarded as a fast approximation of the CGC+SF model. Only the luma channel of each block was distorted by changing the HEVC quantization parameter QP from 0 to 51. The visibility threshold (contrast detection threshold CT ) and the corresponding quantization parameter threshold (QPT ) were calculated for each of the 1920 patches using the CGC+SF model. Moreover, the number of patches was increased by a factor of four (to a total of 7680 patches) by horizontally and/or vertically flipping the original patches. Flipping a patch would not, in principle, change the threshold for visually detecting distortions within that patch; however, in terms of the network, flipping a patch provides a different input pattern, and thus facilitates learning. We used 70% of the data for training, 15% for validation, and 15% for testing. The data were randomly selected from each image to avoid image bias. A committee of 10 networks was trained via the Flexible-ImageRecognition-Software-Toolbox42 using the resilient backpropagation algorithm;43 the results from the 10 networks were averaged to predict the thresholds. 3.3 Prediction performance of CNN model The CNN model provided reasonable predictions the thresholds from the CGC+SF model. Figure 7 shows the scatter plots between the CGC+SF thresholds and the CNN predictions. The Pearson correlation coefficients were 0.95 and 0.93 for training+validation and testing data, respectively. From Figure 7, observe that the scatter plots are saturating at both ends of the visibility threshold range. Below −40 dB the patches are very smooth, and above 0 dB the patches are very dark. In both cases, the patches are devoid of visible content, and thus the coding artifacts would be imperceptible if multiple blocks are coded similarly. 3.4 QP prediction using neural network The CNN model predicts the visibility threshold (detection threshold CT ) for each patch. To use the model in an HEVC encoder, we thus need to predict the HEVC quantization parameter QP such that the resulting distortions exhibit a contrast at or below CT . Note the quantization step (Qstep ) in HEVC relates to QP via44 Qstep = (21/6 )QP −4 . Proc. of SPIE Vol. 9599 959918-7 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx (5) CNN predictions (dB) 20 20 (a) Training + Validation PCC: 0.95 10 0 0 -10 -10 -20 -20 -30 -30 -40 -40 -50 -50 -60 -60 -40 -20 0 (b) Testing, PCC: 0.93 10 20 -60 -60 -40 -20 0 20 Ground truth distortion visibilities: Outputs of CGC+SF model (dB) Figure 7. Scatter plots between CGC+SF thresholds and CNN predictions. Example patches at various levels of distortion visibilities are also shown using arrows. 250 6 (a) (b) Patch 1 5 Patch 2 200 Patch 3 4 150 3 2 100 1 50 0 0 -70 -60 -50 -40 -30 -20 -10 0 -1 -70 -60 -50 -40 -30 -20 -10 0 RMS contrast (•) of HEVC distortion (dB) Figure 8. (a) The quantization step (Qstep ) has an exponential relationship with the contrast (C) of the HEVC distortion; (b) log(Qstep ) has a quadratic relationship with C. Figure 8 shows plots of (a) Qstep and (b) log(Qstep ) as a function of the contrast (C) of the induced HEVC distortion. Note that Qstep shows an exponential relationship with C, and log(Qstep ) shows a quadratic relationship with C. Thus, we use the following equation to predict log(Qstep ) from C: log(Qstep ) = αC 2 + βC + γ, (6) where α, β, and γ are model coefficients which change depending on patch features. The coefficients α, β, and γ can be reasonably well-modeled by using four simple features of a patch: (i) the Michelson luminance contrast, (ii) the RMS luminance contrast, (iii) the standard deviation of luminance, and (iv) the y-axis intercept of a plot of the patch’s log-magnitude spectrum. Descriptions of these features are available in the “Analysis” section of our masking dataset.24 To predict α, β, and γ from the patch features, we designed three separate committees of neural networks. Each committee had a total of five two-layered feed-forward networks, and each network contained 10 neurons. Each of the networks were trained by using a scaled conjugate gradient algorithm with 70% training, 15% validation, and 15% test data. Figure 9 shows the scatter plots of α, β, and γ. In the x-axis values of α, β, γ found from fitting the equation 6 are shown, and in the y-axis α, β, γ predictions from the feed-forward networks are shown. Observe that the Proc. of SPIE Vol. 9599 959918-8 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx prediction from patch features 0.012 1.5 (a) • (PCC: 0.96) 40 (b) (PCC: 0.97) 35 (c) ! (PCC: 0.97) 0.01 30 0.008 1 25 0.006 20 15 0.004 0.5 10 0.002 5 0 0 0.002 0.004 0.006 0.008 0.01 0.012 0 0 0.5 1 1.5 0 0 10 •, , ! found from fitting the quadratic equation: log "#$%& = •' ( 20 30 40 + '+! Figure 9. Scatter plots of (a) α, (b) β, and (c) γ. The x-axis shows α, β, and γ found by fitting Equation (6). The y-axis α, β, and γ are predictions from the feed-forward networks. patch features predict α, β, and γ reasonably well, yielding Pearson correlation coefficients of 0.96, 0.97, and 0.97, respectively. The prediction process of the quantization parameter QP for an input image patch is thus summarized as follows: • First, predict the distortion visibility (CT ) using the committee of VNet-2. • Second, calculate four features of the patch: (i) Michelson contrast, (ii) RMS contrast, (iii) Standard deviation of patch luminance, and (iv) y-axis cut point of log-magnitude spectra. • Third, using the four features as inputs predict α, β, γ from the committees of feed-forward networks. • Fourth, calculate log(Qstep ) using equation 6. • Finally, calculate QP using equation: ( ( [ ] ) ) log(Qstep ) QP = max min round + 4 , 51 ,0 . log(21/6 ) (7) Furthermore, we have noticed that when CT < −40 dB the patches are very smooth, and thus can be coded with higher QP . Thus, for both the CGC+SF model and CNN model, QP was set to 30 for the patches with CT < −40 dB. 4. RESULTS In this section we briefly present our preliminary results. 4.1 QP maps Figure 10 shows the QP maps for two 512 × 512 images. The QP maps are generated from CGC+SF and CNN models. Note that the QP maps show how the bits get allocation during encoding process. For example, the dark regions in the redwood tree (in redwood image) and the bush near the wood-log (in the log seaside image) show higher QP values, denoting those regions will be coded coarsely compared to other regions. Proc. of SPIE Vol. 9599 959918-9 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx •• map from CGC+SF •• map from CNN ;:ñ 50 redwood 40 30 20 r ßC log_ seaside 10 0 Figure 10. The quantization parameter (QP ) map for two images coming from CGC+SF and CNN models. 4.2 Compression performance Both the CGC+SF and CNN models take the 64 × 64-pixels image patch as input, and predict the distortion visibility (CT ) and corresponding threshold QP . Preliminarily we have used the QP maps to test the compression performance using sample images from the CSIQ dataset.34 While coding a HEVC coding tree unit, we searched for the minimum QP around the spatial location of that unit within the image, and assigned the minimum QP to that unit. Figure 11 shows example images displaying the compression performance. Note that coding using QP map essentially only add near or below-threshold distortions, and thus judging the quality of the images is quite difficult. However, for comparing with the reference HEVC, a visual matching experiment was performed by two experienced subjects. The purpose of the experiment was to find at which QP, the reference software coded image matches with the same quality of the CGC+SF model coded image. After the visually equivalent reference coded image was found, the corresponding bits-per-pixel (bpp) was recorded. In Figure 11 the first column shows the reference images. The second column shows the reference software coded images which are visually equivalent to the images coded using the CGC+SF map. The third column shows the images coded using CGC+SF QP map, and the fourth column shows the images coded using CNN QP map. Although for near/below threshold distortions subjective experiments are more reliable for assessing the quality, we have reported the structural-similarity index (SSIM)45 between the reference images and the coded images. Note the qualities of the images are very close. The bpp and compression gains are also shown below the images. On average using QP maps from the CGC+SF and CNN models yielded 15% and 11% gains, respectively. It should be noted that we generated the QP maps mainly from contrast masking and structural facilitation. Thus, if any image does not contain enough local region to mask the distortions, using the QP map yields no gain. 4.3 Execution-time of QP map prediction One of the reason for introducing a CNN model after the CGC+SF model was that CGC+SF model is too slow to be considered for a real-time application. The execution times of the two models were calculated using a computer having Intel(R) Core(TM)2 Quad CPU Q9400 2.67 GHz processor, 12 GB of RAM, NVIDIA Quadro K5000 graphics card, and running Matlab2014a. The execution times of the two models are calculated to generate QP maps for 512 × 512 pixels images with 64 × 64 blocks. The execution times for the CGC+SF model and the CNN model were 239.83 ± 30.564 sec and 2.24 ± 0.29 sec, respectively. Thus, on average the CNN model was 106 times faster compared to the CGC+SF model. Proc. of SPIE Vol. 9599 959918-10 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx Reference image Visually equivalent image Coded using •• map Coded using •• map For CGC+SF model For CNN model SSIM: 0.94, bpp: 3.01 SSIM: 0.94 SSIM: 0.94 bpp: 2.60, gain 13.7% bpp: 2.58, gain 14.3% SSIM:0.89 bpp: 2.13 SSIM: 0.88 bpp: 1.82, gain 14.9% SSIM: 0.88 bpp: 2.12, gain 0.5% `-,V,:w SSIM: 0.96 bpp: 2.43 SSIM: 0.94 SSIM: 0.92 bpp: 1.69, gain 18.3% bpp: 1.35, gain 35.2% a ':9 SSIM: 0.96 bpp: 2.49 ' SSIM: 0.94 bpp: 2.25, gain 9.6% SSIM: 0.94 bpp: 2.39, gain 3.9% ñ. t SSIM: 0.82 bpp: 2.12 SSIM: 0.82 bpp: 1.69, gain 20.2% SSIM: 0.84 bpp: 2.03, gain 4.3% Figure 11. Compression performance by using the QP map generated from CGC+SF and CNN models. 5. CONCLUSIONS AND FUTURE WORKS This paper described a fast convolutional-neural-network based quantization strategy for HEVC. Local artifact visibility was predicted via a network trained on data derived from an improved contrast gain control model. The Proc. of SPIE Vol. 9599 959918-11 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx contrast gain control model was trained on our recent database of local distortion visibility in natural scenes.24 Furthermore, to incorporate the effects of recognizable structures on distortion visibility, a structural facilitation model was proposed which was incorporate with the contrast gain control model. Our results provide on average 11% improvements in compression efficiency for HEVC using the QP map derived from the CNN model, while requiring almost one hundredths of the computational time of the CGC+SF model. Future avenue of our work includes computational time analysis for predicting distortion visibility in real-time videos. REFERENCES [1] Draft, I., “recommendation and final draft international standard of joint video specification (itu-t rec. h. 265— iso/iec 23008-2 hevc),” (Nov. 2013). [2] Lubin, J., “Sarnoff jnd vision model: Algorithm description and testing,” Presentation by Jeffrey Lubin of Sarnoff to T1A1 5 (1997). [3] Winkler, S., “Vision models and quality metrics for image processing applications,” These de doctorat, École Polytechnique Fédérale de Lausanne (2000). [4] Watson, A. B., Hu, J., and McGowan, J. F., “Digital video quality metric based on human vision,” Journal of Electronic imaging 10(1), 20–29 (2001). [5] Watson, A. B., “Dctune: A technique for visual optimization of dct quantization matrices for individual images,” in [Sid International Symposium Digest of Technical Papers], 24, 946–949, Society for Information Display (1993). [6] Chou, C.-H. and Li, Y.-C., “A perceptually tuned subband image coder based on the measure of justnoticeable-distortion profile,” Circuits and Systems for Video Technology, IEEE Transactions on 5(6), 467– 476 (1995). [7] Höntsch, I. and Karam, L. J., “Adaptive image coding with perceptual distortion control,” Image Processing, IEEE Transactions on 11(3), 213–222 (2002). [8] Chou, C.-H., Chen, C.-W., et al., “A perceptually optimized 3-d subband codec for video communication over wireless channels,” Circuits and Systems for Video Technology, IEEE Transactions on 6(2), 143–156 (1996). [9] Peterson, H. A., Peng, H., Morgan, J., and Pennebaker, W. B., “Quantization of color image components in the dct domain,” in [Electronic Imaging’91, San Jose, CA], 210–222, International Society for Optics and Photonics (1991). [10] Ahumada Jr, A. J. and Peterson, H. A., “Luminance-model-based dct quantization for color image compression,” in [SPIE/IS&T 1992 Symposium on Electronic Imaging: Science and Technology], 365–374, International Society for Optics and Photonics (1992). [11] Zhang, X., Lin, W., and Xue, P., “Improved estimation for just-noticeable visual distortion,” Signal Processing 85(4), 795–808 (2005). [12] Jia, Y., Lin, W., Kassim, A., et al., “Estimating just-noticeable distortion for video,” Circuits and Systems for Video Technology, IEEE Transactions on 16(7), 820–829 (2006). [13] Naccari, M. and Pereira, F., “Advanced h. 264/avc-based perceptual video coding: Architecture, tools, and assessment,” Circuits and Systems for Video Technology, IEEE Transactions on 21(6), 766–782 (2011). [14] Foley, J. M. and Boynton, G. M., “New model of human luminance pattern vision mechanisms: analysis of the effects of pattern orientation, spatial phase, and temporal frequency,” in [Proceedings of SPIE], 2054, 32 (1994). [15] Naccari, M. and Mrak, M., “Intensity dependent spatial quantization with application in hevc,” in [Multimedia and Expo (ICME), 2013 IEEE International Conference on], 1–6, IEEE (2013). [16] Blasi, S. G., Mrak, M., and Izquierdo, E., “Masking of transformed intra-predicted blocks for high quality image and video coding,” in [Image Processing (ICIP), 2014 IEEE International Conference on], 3160–3164, IEEE (2014). [17] Yu, Y., Zhou, Y., Wang, H., Tu, Q., and Men, A., “A perceptual quality metric based rate-quality optimization of h. 265/hevc,” in [Wireless Personal Multimedia Communications (WPMC), 2014 International Symposium on], 12–17, IEEE (2014). Proc. of SPIE Vol. 9599 959918-12 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx [18] Li, Y., Liao, W., Huang, J., He, D., and Chen, Z., “Saliency based perceptual hevc,” in [Multimedia and Expo Workshops (ICMEW), 2014 IEEE International Conference on], 1–5, IEEE (2014). [19] Caelli, T. and Moraglia, G., “On the detection of signals embedded in natural scenes,” Perception and Psychophysics 39, 87–95 (1986). [20] Nadenau, M. J., Reichel, J., and Kunt, M., “Performance comparison of masking models based on a new psychovisual test method with natural scenery stimuli,” Signal processing: Image communication 17(10), 807–823 (2002). [21] Watson, A. B., Borthwick, R., and Taylor, M., “Image quality and entropy masking,” Proceedings of SPIE 3016 (1997). [22] Chandler, D. M., Gaubatz, M. D., and Hemami, S. S., “A patch-based structural masking model with an application to compression,” J. Image Video Process. 2009, 1–22 (2009). [23] Chandler, D. M., Alam, M. M., and Phan, T. D., “Seven challenges for image quality research,” in [IS&T/SPIE Electronic Imaging], 901402–901402, International Society for Optics and Photonics (2014). [24] Alam, M. M., Vilankar, K. P., Field, D. J., and Chandler, D. M., “Local masking in natural images: A database and analysis,” Journal of vision 14(8), 22 (2014. Database available at: http://vision.okstate.edu/masking/). [25] Alam, M. M., Vilankar, K. P., and Chandler, D. M., “A database of local masking thresholds in natural images,” in [IS&T/SPIE Electronic Imaging ], 86510G–86510G, International Society for Optics and Photonics (2013). [26] Legge, G. E. and Foley, J. M., “Contrast masking in human vision,” J. of Opt. Soc. Am. 70, 1458–1470 (1980). [27] Wei, Z. and Ngan, K. N., “Spatio-temporal just noticeable distortion profile for grey scale image/video in dct domain,” Circuits and Systems for Video Technology, IEEE Transactions on 19(3), 337–346 (2009). [28] Teo, P. C. and Heeger, D. J., “Perceptual image distortion,” Proceedings of SPIE 2179, 127–141 (1994). [29] Foley, J. M., “Human luminance pattern-vision mechanisms: Masking experiments require a new model,” J. of Opt. Soc. Am. A 11(6), 1710–1719 (1994). [30] Lubin, J., “A visual discrimination model for imaging system design and evaluation,” in [Vision Models for Target Detection and Recognition], Peli, E., ed., 245–283, World Scientific (1995). [31] Watson, A. B. and Solomon, J. A., “A model of visual contrast gain control and pattern masking,” J. of Opt. Soc. Am. A 14(9), 2379–2391 (1997). [32] DeValois, R. L. and DeValois, K. K., [Spatial Vision ], Oxford University Press (1990). [33] Daly, S. J., “Visible differences predictor: an algorithm for the assessment of image fidelity,” in [Digital Images and Human Vision ], Watson, A. B., ed., 179–206 (1993). [34] Larson, E. C. and Chandler, D. M., “Most apparent distortion: full-reference image quality assessment and the role of strategy,” Journal of Electronic Imaging 19(1), 011006 (2010). [35] Peli, E., “Contrast in complex images,” J. of Opt. Soc. Am. A 7, 2032–2040 (1990). [36] Juan, C.-H. and Walsh, V., “Feedback to v1: a reverse hierarchy in vision,” Experimental Brain Research 150(2), 259–263 (2003). [37] Vu, C. T., Phan, T. D., and Chandler, D. M., “S3: A spectral and spatial measure of local perceived sharpness in natural images,” IEEE Transactions on Image Processing 21(3), 934–945 (2012). [38] Costa, A. F., Humpire-Mamani, G., and Traina, A. J. M., “An efficient algorithm for fractal analysis of textures,” in [Graphics, Patterns and Images (SIBGRAPI), 2012 25th SIBGRAPI Conference on], 39–46, IEEE (2012). [39] Phan, T. D., Shah, S. K., Chandler, D. M., and Sohoni, S., “Microarchitectural analysis of image quality assessment algorithms,” Journal of Electronic Imaging 23(1), 013030 (2014). [40] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE 86(11), 2278–2324 (1998). [41] Alam, M. M., Patil, P., Hagan, M. T., and Chandler, D. M., “A computational model for predicting local distortion visibility via convolutional neural network trained on natural scenes,” in [IEEE International Conference on Image Processing (ICIP)], Accepted to be presented, IEEE (2015). Proc. of SPIE Vol. 9599 959918-13 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx [42] Patil, P., “Flexible image recognition software toolbox (first),” in [MS Thesis ], Oklahoma State University, Stillwater, OK (July, 2013). [43] Riedmiller, M. and Braun, H., “A direct adaptive method for faster backpropagation learning: The rprop algorithm,” in [Neural Networks, 1993., IEEE International Conference on], 586–591, IEEE (1993). [44] Sze, V., Budagavi, M., and Sullivan, G. J., [High Efficiency Video Coding (HEVC)], Springer (2014). [45] Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P., “Image quality assessment: from error visibility to structural similarity,” Image Processing, IEEE Transactions on 13(4), 600–612 (2004). Proc. of SPIE Vol. 9599 959918-14 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx