A perceptual quantization strategy for HEVC based on a

advertisement
A perceptual quantization strategy for HEVC based on a
convolutional neural network trained on natural images
Md Mushfiqul Alam∗a , Tuan D. Nguyena , Martin T. Hagana , and Damon M. Chandlerb
a
School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK,
USA; b Dept. of Electrical and Electronic Engineering, Shizuoka University, Hamamatsu, Japan
ABSTRACT
Fast prediction models of local distortion visibility and local quality can potentially make modern spatiotemporally adaptive coding schemes feasible for real-time applications. In this paper, a fast convolutional-neuralnetwork based quantization strategy for HEVC is proposed. Local artifact visibility is predicted via a network
trained on data derived from our improved contrast gain control model. The contrast gain control model was
trained on our recent database of local distortion visibility in natural scenes [Alam et al. JOV 2014]. Furthermore, a structural facilitation model was proposed to capture effects of recognizable structures on distortion
visibility via the contrast gain control model. Our results provide on average 11% improvements in compression
efficiency for spatial luma channel of HEVC while requiring almost one hundredth of the computational time of
an equivalent gain control model. Our work opens the doors for similar techniques which may work for different
forthcoming compression standards.
Keywords: HEVC, distortion visibility, convolutional neural network, natural images, visual masking
1. INTRODUCTION
Most digital videos have undergone some form of lossy coding, a technology which has enabled the transmission
and storage of vast amounts of videos with increasingly higher resolutions. A key attribute of modern lossy coding
is the ability to trade-off the resulting bit-rate with the resulting visual quality. The latest coding standard, high
efficiency video coding (HEVC),1 has emerged as an effective successor to H.264, particularly for high-quality
videos. In this high-quality regime, an often-sought goal is to code each spatiotemporal region with the minimum
number of bits required to keep the resulting distortions imperceptible. Achieving such a goal, however, requires
the ability to accurately and efficiently predict the local visibility of coding artifacts, a goal which currently
remains elusive.
The most common technique of estimating distortion visibility is to employ a model of just noticeable distortion (JND);2–4 and such JND-type models have been used to design improved quantization strategies for use in
image coding.5–8 However, these approaches are based largely on findings from highly controlled psychophysical
studies using unnatural videos. For example, Peterson et al.9 performed an experiment to determine visual
sensitivity to DCT basis functions in the RGB and YCrCb color spaces; and, these results have been used to
develop distortion visibility models.5, 10–12 However, the presence of a video could induce visual masking, thus
changing the visual sensitivity to the DCT basis functions.
Several researchers have also employed distortion visibility models specifically for HEVC and H.264. Naccari and Pereira proposed a just-noticeable-difference (JND) model for H.264 by considering visual masking in
luminance, frequency, pattern, and temporal domains.13 However, the coding scheme was based on Foley and
Boynton’s14 contrast gain-control model which did not consider the gain control in V1 neurons. In another
study, Naccari et al. proposed a quantization scheme for HEVC to capture luminance masking based on measurements of pixel intensities.15 Blasi et al. proposed a quantization scheme for HEVC which took into account
visual masking by selectively discarding frequency components of the residual signal depending on the input patterns.16 However, the scheme of frequency-channel discarding was not based on psychophysical measurements.
Other researchers, for example Yu et al.17 and Li et al.,18 have used saliency information, rather than visual
masking to achieve higher compression for HEVC.
*mdma@okstate.edu; phone: 1 718 200-8502; vision.okstate.edu/masking
Applications of Digital Image Processing XXXVIII, edited by Andrew G. Tescher,
Proc. of SPIE Vol. 9599, 959918 · © 2015 SPIE · CCC code: 0277-786X/15/$18
doi: 10.1117/12.2188913
Proc. of SPIE Vol. 9599 959918-1
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Thus, one potential area for improved coding is to develop and employ a visibility model which is not just
designed for naturalistic masks (natural videos), but also designed to adapt to individual video characteristics
(see, e.g., Refs. 19–23, which aim toward this adaptive approach for images). Another area for improvement is
in regards to runtime performance. Operating fuller distortion visibility models in a block-based fashion, so as
to achieve local predictions, is often impractical due to prohibitive computational and memory demands. Thus,
predicting local distortion visibility in a manner which is simultaneously more accurate and more efficient than
existing models remains an open research challenge.
Here, we propose an HEVC-based quantization scheme based on a fast model of local artifact visibility
designed specifically for natural videos. The model uses a convolutional-neural-network (CNN) architecture for
predicting local artifact visibility and quality of each spatiotemporal region of a video. We trained the CNN
model using our recently published database of local masking and quality in natural scenes.24, 25 When coupled
with an HEVC encoder, our results provided 11% more compression over baseline HEVC when matched in visual
quality, while requiring one hundredth the computational time of an equivalent gain control model. Our work
opens the doors for similar techniques which may work for different forthcoming compression standards.
2. DISTORTION VISIBILITY PREDICTION MODEL-1: CONTRAST GAIN
CONTROL WITH STRUCTURAL FACILITATION (CGC+SF)
This paper describes two separate models for predicting distortion visibility: (1) A model of masking which
operates by simulating V1 neural responses with contrast gain control (CGC); and (2) a model which employs a
convolutional neural network (CNN). The training data of the CNN model are derived from the CGC model.
Contrast masking26 has been widely used for predicting distortion visibility in images and videos.5, 11, 12, 27
Among the many existing models of contrast masking, those which simulate the contrast gain-control response
properties of V1 neurons are most widely used. Although several gain control models have been proposed
in previous studies (e.g., Refs. 26, 28–31), in most cases, the model parameters are selected based on results
obtained using either unnatural masks31 or only a very limited number of natural images.20, 22 Here, we have
used the contrast masking model proposed by Watson and Solomon.31 This model is well-suited for use a generic
distortion visibility predictor because: (1) the model mimics V1 neurons and is therefore theoretically distortiontype-agnostic; (2) the model parameters can be tuned to fall within biologically plausible ranges; (3) the model
can directly take images as inputs, rather than features of images as inputs; and (4) the model can operate in a
spatially localized fashion.
We improved upon the CGC model in two ways. First, we optimized the CGC model by using our recently
developed dataset of local masking in natural scenes24 (currently, the largest dataset of its kind). Second, we
incorporated into the CGC model a structural facilitation model which better captures the reduced masking
observed in structured regions.
2.1 Watson-Solomon31 contrast gain control (CGC) model
This subsection briefly describes the Watson and Solomon contrast gain-control model.31 Figure 1 shows the
Watson and Solomon model flow. Both the mask (image) and mask+target (distorted image) go through the same
stages: a spatial filter to simulate the contrast sensitivity function (CSF), a log-Gabor decomposition, excitatory
and inhibitory nonlinearities, inhibitory pooling, and division.31 After division, the subtracted responses of the
mask and mask+target are pooled via Minkowski pooling.31 The target contrast is changed iteratively until the
Minkowski-pooled response equals a pre-defined “at threshold” decision (d), at which point, the contrast of the
target is deemed the final threshold.
The Watson and Solomon model is a model of V1 simple-cell responses (after “Divide” in Figure 1). Let
z(x0 , y0 , f0 , θ0 ) denote the initial linear modeled neural response (log-Gabor filter output) at location (x0 , y0 ),
center radial frequency f0 , and orientation θ0 . In the model, the response of a neuron (after division) tuned to
these parameters is given by
z(x0 , y0 , f0 , θ0 )p
∑
r(x0 , y0 , f0 , θ0 ) = g · q
(1)
q,
b +
(z(x0 , y0 , f0 , θ0 ))
(x,y,f,θ)∈IN
Proc. of SPIE Vol. 9599 959918-2
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Inhibitory
Nonlinearity
Mask
+
Target
CSF
LogGabor
Excitatory
Nonlinearity
Inhibitory
Pooling
Divide
Subtract
Mask
CSF
LogGabor
Excitatory
Nonlinearity
Inhibitory
Nonlinearity
Change
Target
Contrast
Divide
Inhibitory
Pooling
No
Minkowski
Pool
dˆ » d ?
Yes
Calculate
Final
Threshold
Figure 1. Flow of Watson and Solomon contrast gain control model.31 For description of the model please see section 2.1.
where g is the output gain,22 p provides the point-wise excitatory nonlinearity, q provides the point-wise inhibitory
nonlinearity from the neurons in the inhibitory pool, IN indicates the set of neurons which are included in the
inhibitory pool, and b represents the semi-saturation constant.
Watson and Solomon31 optimized their model parameters using results of Foley and Boynton’s14 masking
experiment using sine-wave gratings and Gabor patterns. However, as mentioned previously, we have recently
developed a large dataset of local masking in natural scenes.24 Here, this dataset was used with a brute-force
search to determine the optimum model parameters within biologically plausible ranges.22, 32 The parameters of
the optimized gain control model are given in Table 1.
Table 1. Parameters of the optimized gain control model. Parameters with (*) superscript were varied. Values of the
parameters without (*) superscript were found from Watson and Solomon’s31 work and Chandler et al.’s22 work. Detailed
description of the functions can be found in our previous work.24
Symbol
fCSF
BWCSF
nf ∗
BWf ∗
f0G ∗
nθ
BWθ
θ0G
p,q ∗
b
g∗
sx,y
sf
sθ
βr ,βf ,βθ
d
Description
Value
Contrast sensitivity filter33, 34
Peak frequency
6.0 c/deg
log10 bandwidth
1.43
Log-Gabor decomposition
Number of frequency bands
6
Frequency channel bandwidth
2.75 octave
Center radial frequencies of the bands 0.3, 0.61, 1.35, 3.22, 7.83, 16.1 c/deg
Number of orientation channels
6
Orientation channel bandwidth
300
Center angles of orientation channels
00 , ±300 , ±600 , 900
Divisive gain control
Excitatory and inhibitory exponents
2.4, 2.35
Semi-saturation constant
0.035
Output gain
0.1
Inhibitory pool kernel: space
3 × 3 Gaussian, zero mean, 0.5 standard deviation24
Inhibitory pool kernel: frequency
±0.7 octave bandwidth with equal weights24
Inhibitory pool kernel: orientation
±600 bandwidth with equal weights24
Minkowski exponents: space,
2.0, 1.5, 1.5
frequency, and orientation
“At threshold”decision
1.0
2.2 Structural facilitation (SF) model
With the abovementioned parameter optimizations, the Watson and Solomon gain control model performed
fairly well in predicting the detection thresholds from our local masking database; a Pearson correlation coefficient of 0.83 was observed between the ground-truth thresholds and the model predictions. However, the the
model consistently overestimated the thresholds for (underestimated the visibilities of) distortions in regions
containing recognizable structures. For example, Figure 2 shows how the optimized Watson and Solomon model
overestimates the thresholds near the top of the gecko’s body, and in the child’s face.
Proc. of SPIE Vol. 9599 959918-3
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Experiment distortion
visibility map
Optimized
contrast gain control
-25
20
-30
40
Wrong
predictions
geckos
-35
60
-40
80
-45
100
20
40
60
80
100
-20
20
-25
40
child_
swimming
-30
60
Wrong
predictions
-35
80
-40
-45
100
20
40
60
80
100
Figure 2. The optimized Watson and Solomon model overestimated detection thresholds for distortion in image regions
containing recognizable structures (e.g., near the top of the gecko’s body, and in the child’s face). The distortion visibility
maps show the contrast detection thresholds in each spatial location of the images in terms of RMS contrast.35 The
colorbars on the right denote the ranges of distortion visibilities for the corresponding image.
2.2.1 Facilitation via reduced inhibition
Figure 2 suggests that recognizable structures within the local regions of natural scenes facilitate (rather than
mask) the distortion visibility. This observation also occurred in several other images in our dataset (see Ref.
24 for a description of the observed facilitation). Here, to model this “structural facilitation,” we employ an
inhibition multiplication factor (λs ) in the gain control equation:
r(x0 , y0 , f0 , θ0 ) = g ·
bq
+ λs
z(x0 , y0 , f0 , θ0 )p
∑
q.
(z(x0 , y0 , f0 , θ0 ))
(2)
(x,y,f,θ)∈IN
The value of λs varies depending on the strength of structure within an image. Figure 3 shows the change of
Inhibition multiplier,
!
1
0.8
0.6
Weak structure
(higher neuron
inhibition)
Strong structure
(lower neuron
inhibition)
0.4
0.2
0
0
0.02
0.04
0.06
0.08
Structure, •
Figure 3. The inhibition multiplier λs varies depending on structure strength. Strong structures give rise to lower inhibition
to facilitate the distortion visibility.
Proc. of SPIE Vol. 9599 959918-4
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
inhibition in V1 simple cells depending on the strength of a structure. Recent cognitive studies36 have shown
active feedback connections to V1 neurons coming from higher level cortices. Our assumption is that recognition
of a structure results in lower inhibition in V1 simple cells via increased feedback from higher levels. For the ith
block of an image, λsi is calculated via the following sigmoid function:
[ (
]

( Si (x,y)−p(S,80) ))
1 − 80 ∑M,N
, max(S) > 0.04 & Kurt(S) > 3.5
x,y=1,1 1/ 1 + exp −
0.005
λsi =
(3)

1,
otherwise,
where S denotes a structure map of an image, Si denotes ith block of S, M and N denote the block width and
height, and p(S, 80) denotes the 80th percentile of S. The maximum value and kurtosis of the structure map S
are used to determine whether there is localized strong structure within the image. The values of the parameters
of equation 3 were determined empirically.
2.2.2 Structure detection
The structure map S of an image is generated via the following equation which uses different feature maps:
S = Ln × Shn × En × (1 − Dµ n )2 × (1 − Dσ n )2 .
(4)
Ln denotes a map of local luminance. Shn denotes a map of local sharpness.37 En denotes a map of local
Shannon (first-order) entropy. Dµ n and Dσ n denote maps of the average and standard deviation of, respectively,
fractal texture features38 computed for each local region. Each of these features was measured for 32 × 32-pixel
blocks with 50% overlap between neighboring blocks. Each map was then globally normalized so as to bring all
values within the range 0 to 1, and then resized by using bicubic interpolation to match the dimensions of the
input image. Examples of structure maps computed for two images are shown in Figure 4.
0.09
0.09
geckos
structure map, •
child_swimming
structure map, •
0.08
0.08
0.07
0.07
0.06
0.06
0.05
0.05
ir.
0.04
0.04
0.03
0.03
0.02
0.01
0.02
0.01
0.0
0
Figure 4. Structure maps of two example images. The colorbar at right denotes the structure strength at each spatial
location of the structure map.
2.2.3 Performance of the combined (CGC+SF) model
By incorporating the SF model with the optimized CGC model (as specified in Equation (2)), we were able to
improve the model’s predictions for regions of images containing recognizable structures, while not adversely
affecting the predictions for other regions. Figure 5 shows some results of using the combined (CGC+SF)
model. Observe that the structural facilitation improved the prediction performance in local regions of images
containing recognizable structures. For example, in the geckos image, near the top gecko’s skin, the predictions
using the CGC+SF model better match the ground-truth thresholds as compared to using only the CGC model.
Furthermore, the Pearson correlation coefficients between the CGC+SF model predictions and ground-truth
thresholds also improved as compared to using the CGC model alone.
3. DISTORTION VISIBILITY PREDICTION MODEL-2: CONVOLUTIONAL
NEURAL NETWORK (CNN) MODEL
In terms of accuracy, the CGC+SF model performs quite well at predicting local distortion visibilities. However,
this prediction accuracy comes at the cost of heavy computational and memory demands. In particular, the
Proc. of SPIE Vol. 9599 959918-5
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Experiment distortion
visibility map
CGC+SF: Optimized
CGC: Optimized
contrast gain control
contrast gain control (with structural facilitation)
-15
20
-20
40
-25
geckos
60
-30
80
-35
100
PCC: 0.68
20
40
PCC: 0.77
60
80
100
-15
-20
20
-25
40
child_
swimming
-30
60
-35
80
-40
100
PCC: 0.85
20
40
PCC: 0.87
60
80
100
20
-20
40
foxy
-25
60
-30
80
-35
100
PCC: 0.58
20
40
PCC: 0.63
60
80
100
-15
Pet'_
i'
-20
20
-25
40
couple
-30
60
-35
80
-40
100
PCC: 0.55
20
40
PCC: 0.60
60
80
100
Figure 5. Structural facilitation improves the distortion visibility predictions in local regions of images containing recognizable structures. Pearson correlation coefficient (PCC) of each map with the experiment map is shown below the
map.
CGC+SF model uses a filterbank consisting of 20 log-Gabor filters yielding complex-valued responses to mimic
on-center and off-center neural responses. This filtering is quite expensive in terms of memory requirements,
which easily exceed the cache on modern processors.39 In addition, the computation of the CGC responses is
expensive, particularly considering that an iterative search procedure is required for the CGC+SF model to
predict the thresholds for each block.
One way to overcome these issues, which is the approach adopted in this paper, is to use a model based on a
convolutional neural network40 (CNN). The key advantages of a CNN model are two-fold. First, a CNN model
does not require the distorted (reconstructed) image as an input; the predictions can be made based solely on
the original image. The distorted images are required only when training the CNN, and it is during this stage
that characteristics of particular compression artifacts are learned by the model. The second advantage is that
a CNN model does not require a search procedure for its prediction.
3.1 CNN Network architecture
In a recent work we have proposed a CNN architecture (VNet-1)41 to predict local visibility thresholds for Gabor
noise distortions in natural scenes. For VNet-1 the input patch dimension was 85 × 85 pixels (approximately
1.9◦ of visual angle), and the training data came from our aforementioned masking dataset.24
In this paper, we propose a modified CNN architecture (VNet-2) for predicting local visibility thresholds
for HEVC distortions for individual frames of a video. Figure 6 shows the network architecture. The network
contains three layers, C1: Convolution, S2: Subsampling, and F3: Full connection. Each layer contains one
feature map. The convolution kernel size of the C1 layer is 19 × 19. The subsampling factor in layer S2 is 2. The
dimensions of the input patch and feature maps of VNet-2 are M = N = 64, d1r = d1c = 46, d2r = d2c = 23.
Proc. of SPIE Vol. 9599 959918-6
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Input patch
(• × •)
Feature map
( !" × #")
Layer C1:
Convolution
Feature map
( !$ × #$ )
Layer S2:
Subsampling
Output
(distortion
n visibility)
Layer F3:
Full connection
Figure 6. The network architecture (VNet-2) used to predict the detection thresholds. For description of the architecture
please see section 3.1.
3.2 CNN training parameters
VNet-2 contains a total of 894 trainable parameters: The C1 layer contains 362 trainable parameters (19 × 19
kernel + 1 bias). The S2 layer contains 2 (1 weight + 1 bias) trainable parameters. The F3 layer contains 530
(23 × 23 weight + 1 bias) trainable parameters.
For training VNet-2, the ground-truth dataset was created using the outputs of the CGC+SF model described
in Section 2. Specifically, the CGC+SF model was used to predict a visibility threshold for each 64 × 64 block of
the 30 natural scenes from the CSIQ dataset,34 and these predicted thresholds were then used as ground-truth
for training the CNN model. Clearly, additional prediction error is imposed by using the CGC+SF model to
train the CNN model. However, our masking dataset from Ref. 24 contained thresholds for 85 × 85 blocks,
whereas thresholds for 64 × 64 blocks are needed for use in HEVC. Our CNN should thus be regarded as a fast
approximation of the CGC+SF model.
Only the luma channel of each block was distorted by changing the HEVC quantization parameter QP from 0
to 51. The visibility threshold (contrast detection threshold CT ) and the corresponding quantization parameter
threshold (QPT ) were calculated for each of the 1920 patches using the CGC+SF model. Moreover, the number of
patches was increased by a factor of four (to a total of 7680 patches) by horizontally and/or vertically flipping the
original patches. Flipping a patch would not, in principle, change the threshold for visually detecting distortions
within that patch; however, in terms of the network, flipping a patch provides a different input pattern, and thus
facilitates learning.
We used 70% of the data for training, 15% for validation, and 15% for testing. The data were randomly
selected from each image to avoid image bias. A committee of 10 networks was trained via the Flexible-ImageRecognition-Software-Toolbox42 using the resilient backpropagation algorithm;43 the results from the 10 networks
were averaged to predict the thresholds.
3.3 Prediction performance of CNN model
The CNN model provided reasonable predictions the thresholds from the CGC+SF model. Figure 7 shows the
scatter plots between the CGC+SF thresholds and the CNN predictions. The Pearson correlation coefficients
were 0.95 and 0.93 for training+validation and testing data, respectively. From Figure 7, observe that the scatter
plots are saturating at both ends of the visibility threshold range. Below −40 dB the patches are very smooth,
and above 0 dB the patches are very dark. In both cases, the patches are devoid of visible content, and thus the
coding artifacts would be imperceptible if multiple blocks are coded similarly.
3.4 QP prediction using neural network
The CNN model predicts the visibility threshold (detection threshold CT ) for each patch. To use the model
in an HEVC encoder, we thus need to predict the HEVC quantization parameter QP such that the resulting
distortions exhibit a contrast at or below CT . Note the quantization step (Qstep ) in HEVC relates to QP via44
Qstep = (21/6 )QP −4 .
Proc. of SPIE Vol. 9599 959918-7
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
(5)
CNN predictions (dB)
20
20
(a) Training + Validation
PCC: 0.95
10
0
0
-10
-10
-20
-20
-30
-30
-40
-40
-50
-50
-60
-60
-40
-20
0
(b) Testing, PCC: 0.93
10
20
-60
-60
-40
-20
0
20
Ground truth distortion visibilities: Outputs of CGC+SF model (dB)
Figure 7. Scatter plots between CGC+SF thresholds and CNN predictions. Example patches at various levels of distortion
visibilities are also shown using arrows.
250
6
(a)
(b)
Patch 1
5
Patch 2
200
Patch 3
4
150
3
2
100
1
50
0
0
-70
-60
-50
-40
-30
-20
-10
0
-1
-70
-60
-50
-40
-30
-20
-10
0
RMS contrast (•) of HEVC distortion (dB)
Figure 8. (a) The quantization step (Qstep ) has an exponential relationship with the contrast (C) of the HEVC distortion;
(b) log(Qstep ) has a quadratic relationship with C.
Figure 8 shows plots of (a) Qstep and (b) log(Qstep ) as a function of the contrast (C) of the induced HEVC distortion. Note that Qstep shows an exponential relationship with C, and log(Qstep ) shows a quadratic relationship
with C. Thus, we use the following equation to predict log(Qstep ) from C:
log(Qstep ) = αC 2 + βC + γ,
(6)
where α, β, and γ are model coefficients which change depending on patch features. The coefficients α, β, and γ
can be reasonably well-modeled by using four simple features of a patch: (i) the Michelson luminance contrast,
(ii) the RMS luminance contrast, (iii) the standard deviation of luminance, and (iv) the y-axis intercept of a plot
of the patch’s log-magnitude spectrum. Descriptions of these features are available in the “Analysis” section of
our masking dataset.24
To predict α, β, and γ from the patch features, we designed three separate committees of neural networks.
Each committee had a total of five two-layered feed-forward networks, and each network contained 10 neurons.
Each of the networks were trained by using a scaled conjugate gradient algorithm with 70% training, 15%
validation, and 15% test data.
Figure 9 shows the scatter plots of α, β, and γ. In the x-axis values of α, β, γ found from fitting the equation
6 are shown, and in the y-axis α, β, γ predictions from the feed-forward networks are shown. Observe that the
Proc. of SPIE Vol. 9599 959918-8
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
prediction from patch features
0.012
1.5
(a) • (PCC: 0.96)
40
(b)
(PCC: 0.97)
35
(c) ! (PCC: 0.97)
0.01
30
0.008
1
25
0.006
20
15
0.004
0.5
10
0.002
5
0
0
0.002
0.004
0.006
0.008
0.01
0.012
0
0
0.5
1
1.5
0
0
10
•, , ! found from fitting the quadratic equation: log "#$%& =
•' (
20
30
40
+ '+!
Figure 9. Scatter plots of (a) α, (b) β, and (c) γ. The x-axis shows α, β, and γ found by fitting Equation (6). The y-axis
α, β, and γ are predictions from the feed-forward networks.
patch features predict α, β, and γ reasonably well, yielding Pearson correlation coefficients of 0.96, 0.97, and
0.97, respectively.
The prediction process of the quantization parameter QP for an input image patch is thus summarized as
follows:
• First, predict the distortion visibility (CT ) using the committee of VNet-2.
• Second, calculate four features of the patch: (i) Michelson contrast, (ii) RMS contrast, (iii) Standard
deviation of patch luminance, and (iv) y-axis cut point of log-magnitude spectra.
• Third, using the four features as inputs predict α, β, γ from the committees of feed-forward networks.
• Fourth, calculate log(Qstep ) using equation 6.
• Finally, calculate QP using equation:
(
(
[
]
) )
log(Qstep )
QP = max min round
+
4
,
51
,0 .
log(21/6 )
(7)
Furthermore, we have noticed that when CT < −40 dB the patches are very smooth, and thus can be coded
with higher QP . Thus, for both the CGC+SF model and CNN model, QP was set to 30 for the patches with
CT < −40 dB.
4. RESULTS
In this section we briefly present our preliminary results.
4.1 QP maps
Figure 10 shows the QP maps for two 512 × 512 images. The QP maps are generated from CGC+SF and CNN
models. Note that the QP maps show how the bits get allocation during encoding process. For example, the
dark regions in the redwood tree (in redwood image) and the bush near the wood-log (in the log seaside image)
show higher QP values, denoting those regions will be coded coarsely compared to other regions.
Proc. of SPIE Vol. 9599 959918-9
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
•• map from CGC+SF
•• map from CNN
;:ñ
50
redwood
40
30
20
r ßC
log_
seaside
10
0
Figure 10. The quantization parameter (QP ) map for two images coming from CGC+SF and CNN models.
4.2 Compression performance
Both the CGC+SF and CNN models take the 64 × 64-pixels image patch as input, and predict the distortion
visibility (CT ) and corresponding threshold QP . Preliminarily we have used the QP maps to test the compression
performance using sample images from the CSIQ dataset.34
While coding a HEVC coding tree unit, we searched for the minimum QP around the spatial location of that
unit within the image, and assigned the minimum QP to that unit. Figure 11 shows example images displaying
the compression performance. Note that coding using QP map essentially only add near or below-threshold
distortions, and thus judging the quality of the images is quite difficult. However, for comparing with the
reference HEVC, a visual matching experiment was performed by two experienced subjects. The purpose of the
experiment was to find at which QP, the reference software coded image matches with the same quality of the
CGC+SF model coded image. After the visually equivalent reference coded image was found, the corresponding
bits-per-pixel (bpp) was recorded. In Figure 11 the first column shows the reference images. The second column
shows the reference software coded images which are visually equivalent to the images coded using the CGC+SF
map. The third column shows the images coded using CGC+SF QP map, and the fourth column shows the
images coded using CNN QP map.
Although for near/below threshold distortions subjective experiments are more reliable for assessing the
quality, we have reported the structural-similarity index (SSIM)45 between the reference images and the coded
images. Note the qualities of the images are very close. The bpp and compression gains are also shown below
the images. On average using QP maps from the CGC+SF and CNN models yielded 15% and 11% gains,
respectively. It should be noted that we generated the QP maps mainly from contrast masking and structural
facilitation. Thus, if any image does not contain enough local region to mask the distortions, using the QP map
yields no gain.
4.3 Execution-time of QP map prediction
One of the reason for introducing a CNN model after the CGC+SF model was that CGC+SF model is too slow
to be considered for a real-time application. The execution times of the two models were calculated using a
computer having Intel(R) Core(TM)2 Quad CPU Q9400 2.67 GHz processor, 12 GB of RAM, NVIDIA Quadro
K5000 graphics card, and running Matlab2014a. The execution times of the two models are calculated to generate
QP maps for 512 × 512 pixels images with 64 × 64 blocks. The execution times for the CGC+SF model and the
CNN model were 239.83 ± 30.564 sec and 2.24 ± 0.29 sec, respectively. Thus, on average the CNN model was
106 times faster compared to the CGC+SF model.
Proc. of SPIE Vol. 9599 959918-10
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Reference image
Visually equivalent
image
Coded using •• map Coded using •• map
For CGC+SF model
For CNN model
SSIM: 0.94,
bpp: 3.01
SSIM: 0.94
SSIM: 0.94
bpp: 2.60, gain 13.7% bpp: 2.58, gain 14.3%
SSIM:0.89
bpp: 2.13
SSIM: 0.88
bpp: 1.82, gain 14.9%
SSIM: 0.88
bpp: 2.12, gain 0.5%
`-,V,:w
SSIM: 0.96
bpp: 2.43
SSIM: 0.94
SSIM: 0.92
bpp: 1.69, gain 18.3% bpp: 1.35, gain 35.2%
a
':9
SSIM: 0.96
bpp: 2.49
'
SSIM: 0.94
bpp: 2.25, gain 9.6%
SSIM: 0.94
bpp: 2.39, gain 3.9%
ñ.
t
SSIM: 0.82
bpp: 2.12
SSIM: 0.82
bpp: 1.69, gain 20.2%
SSIM: 0.84
bpp: 2.03, gain 4.3%
Figure 11. Compression performance by using the QP map generated from CGC+SF and CNN models.
5. CONCLUSIONS AND FUTURE WORKS
This paper described a fast convolutional-neural-network based quantization strategy for HEVC. Local artifact
visibility was predicted via a network trained on data derived from an improved contrast gain control model. The
Proc. of SPIE Vol. 9599 959918-11
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
contrast gain control model was trained on our recent database of local distortion visibility in natural scenes.24
Furthermore, to incorporate the effects of recognizable structures on distortion visibility, a structural facilitation
model was proposed which was incorporate with the contrast gain control model. Our results provide on average
11% improvements in compression efficiency for HEVC using the QP map derived from the CNN model, while
requiring almost one hundredths of the computational time of the CGC+SF model. Future avenue of our work
includes computational time analysis for predicting distortion visibility in real-time videos.
REFERENCES
[1] Draft, I., “recommendation and final draft international standard of joint video specification (itu-t rec. h.
265— iso/iec 23008-2 hevc),” (Nov. 2013).
[2] Lubin, J., “Sarnoff jnd vision model: Algorithm description and testing,” Presentation by Jeffrey Lubin of
Sarnoff to T1A1 5 (1997).
[3] Winkler, S., “Vision models and quality metrics for image processing applications,” These de doctorat, École
Polytechnique Fédérale de Lausanne (2000).
[4] Watson, A. B., Hu, J., and McGowan, J. F., “Digital video quality metric based on human vision,” Journal
of Electronic imaging 10(1), 20–29 (2001).
[5] Watson, A. B., “Dctune: A technique for visual optimization of dct quantization matrices for individual
images,” in [Sid International Symposium Digest of Technical Papers], 24, 946–949, Society for Information
Display (1993).
[6] Chou, C.-H. and Li, Y.-C., “A perceptually tuned subband image coder based on the measure of justnoticeable-distortion profile,” Circuits and Systems for Video Technology, IEEE Transactions on 5(6), 467–
476 (1995).
[7] Höntsch, I. and Karam, L. J., “Adaptive image coding with perceptual distortion control,” Image Processing,
IEEE Transactions on 11(3), 213–222 (2002).
[8] Chou, C.-H., Chen, C.-W., et al., “A perceptually optimized 3-d subband codec for video communication
over wireless channels,” Circuits and Systems for Video Technology, IEEE Transactions on 6(2), 143–156
(1996).
[9] Peterson, H. A., Peng, H., Morgan, J., and Pennebaker, W. B., “Quantization of color image components in
the dct domain,” in [Electronic Imaging’91, San Jose, CA], 210–222, International Society for Optics and
Photonics (1991).
[10] Ahumada Jr, A. J. and Peterson, H. A., “Luminance-model-based dct quantization for color image compression,” in [SPIE/IS&T 1992 Symposium on Electronic Imaging: Science and Technology], 365–374,
International Society for Optics and Photonics (1992).
[11] Zhang, X., Lin, W., and Xue, P., “Improved estimation for just-noticeable visual distortion,” Signal Processing 85(4), 795–808 (2005).
[12] Jia, Y., Lin, W., Kassim, A., et al., “Estimating just-noticeable distortion for video,” Circuits and Systems
for Video Technology, IEEE Transactions on 16(7), 820–829 (2006).
[13] Naccari, M. and Pereira, F., “Advanced h. 264/avc-based perceptual video coding: Architecture, tools, and
assessment,” Circuits and Systems for Video Technology, IEEE Transactions on 21(6), 766–782 (2011).
[14] Foley, J. M. and Boynton, G. M., “New model of human luminance pattern vision mechanisms: analysis of
the effects of pattern orientation, spatial phase, and temporal frequency,” in [Proceedings of SPIE], 2054,
32 (1994).
[15] Naccari, M. and Mrak, M., “Intensity dependent spatial quantization with application in hevc,” in [Multimedia and Expo (ICME), 2013 IEEE International Conference on], 1–6, IEEE (2013).
[16] Blasi, S. G., Mrak, M., and Izquierdo, E., “Masking of transformed intra-predicted blocks for high quality
image and video coding,” in [Image Processing (ICIP), 2014 IEEE International Conference on], 3160–3164,
IEEE (2014).
[17] Yu, Y., Zhou, Y., Wang, H., Tu, Q., and Men, A., “A perceptual quality metric based rate-quality optimization of h. 265/hevc,” in [Wireless Personal Multimedia Communications (WPMC), 2014 International
Symposium on], 12–17, IEEE (2014).
Proc. of SPIE Vol. 9599 959918-12
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
[18] Li, Y., Liao, W., Huang, J., He, D., and Chen, Z., “Saliency based perceptual hevc,” in [Multimedia and
Expo Workshops (ICMEW), 2014 IEEE International Conference on], 1–5, IEEE (2014).
[19] Caelli, T. and Moraglia, G., “On the detection of signals embedded in natural scenes,” Perception and
Psychophysics 39, 87–95 (1986).
[20] Nadenau, M. J., Reichel, J., and Kunt, M., “Performance comparison of masking models based on a new
psychovisual test method with natural scenery stimuli,” Signal processing: Image communication 17(10),
807–823 (2002).
[21] Watson, A. B., Borthwick, R., and Taylor, M., “Image quality and entropy masking,” Proceedings of
SPIE 3016 (1997).
[22] Chandler, D. M., Gaubatz, M. D., and Hemami, S. S., “A patch-based structural masking model with an
application to compression,” J. Image Video Process. 2009, 1–22 (2009).
[23] Chandler, D. M., Alam, M. M., and Phan, T. D., “Seven challenges for image quality research,” in
[IS&T/SPIE Electronic Imaging], 901402–901402, International Society for Optics and Photonics (2014).
[24] Alam, M. M., Vilankar, K. P., Field, D. J., and Chandler, D. M., “Local masking in natural images: A database and analysis,” Journal of vision 14(8), 22 (2014. Database available at:
http://vision.okstate.edu/masking/).
[25] Alam, M. M., Vilankar, K. P., and Chandler, D. M., “A database of local masking thresholds in natural images,” in [IS&T/SPIE Electronic Imaging ], 86510G–86510G, International Society for Optics and Photonics
(2013).
[26] Legge, G. E. and Foley, J. M., “Contrast masking in human vision,” J. of Opt. Soc. Am. 70, 1458–1470
(1980).
[27] Wei, Z. and Ngan, K. N., “Spatio-temporal just noticeable distortion profile for grey scale image/video in
dct domain,” Circuits and Systems for Video Technology, IEEE Transactions on 19(3), 337–346 (2009).
[28] Teo, P. C. and Heeger, D. J., “Perceptual image distortion,” Proceedings of SPIE 2179, 127–141 (1994).
[29] Foley, J. M., “Human luminance pattern-vision mechanisms: Masking experiments require a new model,”
J. of Opt. Soc. Am. A 11(6), 1710–1719 (1994).
[30] Lubin, J., “A visual discrimination model for imaging system design and evaluation,” in [Vision Models for
Target Detection and Recognition], Peli, E., ed., 245–283, World Scientific (1995).
[31] Watson, A. B. and Solomon, J. A., “A model of visual contrast gain control and pattern masking,” J. of
Opt. Soc. Am. A 14(9), 2379–2391 (1997).
[32] DeValois, R. L. and DeValois, K. K., [Spatial Vision ], Oxford University Press (1990).
[33] Daly, S. J., “Visible differences predictor: an algorithm for the assessment of image fidelity,” in [Digital
Images and Human Vision ], Watson, A. B., ed., 179–206 (1993).
[34] Larson, E. C. and Chandler, D. M., “Most apparent distortion: full-reference image quality assessment and
the role of strategy,” Journal of Electronic Imaging 19(1), 011006 (2010).
[35] Peli, E., “Contrast in complex images,” J. of Opt. Soc. Am. A 7, 2032–2040 (1990).
[36] Juan, C.-H. and Walsh, V., “Feedback to v1: a reverse hierarchy in vision,” Experimental Brain Research 150(2), 259–263 (2003).
[37] Vu, C. T., Phan, T. D., and Chandler, D. M., “S3: A spectral and spatial measure of local perceived
sharpness in natural images,” IEEE Transactions on Image Processing 21(3), 934–945 (2012).
[38] Costa, A. F., Humpire-Mamani, G., and Traina, A. J. M., “An efficient algorithm for fractal analysis of
textures,” in [Graphics, Patterns and Images (SIBGRAPI), 2012 25th SIBGRAPI Conference on], 39–46,
IEEE (2012).
[39] Phan, T. D., Shah, S. K., Chandler, D. M., and Sohoni, S., “Microarchitectural analysis of image quality
assessment algorithms,” Journal of Electronic Imaging 23(1), 013030 (2014).
[40] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE 86(11), 2278–2324 (1998).
[41] Alam, M. M., Patil, P., Hagan, M. T., and Chandler, D. M., “A computational model for predicting local
distortion visibility via convolutional neural network trained on natural scenes,” in [IEEE International
Conference on Image Processing (ICIP)], Accepted to be presented, IEEE (2015).
Proc. of SPIE Vol. 9599 959918-13
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
[42] Patil, P., “Flexible image recognition software toolbox (first),” in [MS Thesis ], Oklahoma State University,
Stillwater, OK (July, 2013).
[43] Riedmiller, M. and Braun, H., “A direct adaptive method for faster backpropagation learning: The rprop
algorithm,” in [Neural Networks, 1993., IEEE International Conference on], 586–591, IEEE (1993).
[44] Sze, V., Budagavi, M., and Sullivan, G. J., [High Efficiency Video Coding (HEVC)], Springer (2014).
[45] Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P., “Image quality assessment: from error visibility
to structural similarity,” Image Processing, IEEE Transactions on 13(4), 600–612 (2004).
Proc. of SPIE Vol. 9599 959918-14
Downloaded From: http://proceedings.spiedigitallibrary.org/ on 05/07/2016 Terms of Use: http://spiedigitallibrary.org/ss/TermsOfUse.aspx
Download