Smooth Quality Streaming of Live Internet Video

advertisement
Smooth Quality Streaming of Live Internet Video
Dimitrios Miras and Graham Knight
Dept. of Computer Science, University College London
Gower St., London WC1E 6BT
Email: {d.miras, g.knight}@cs.ucl.ac.uk
February 2004
Abstract
A live video stream, when encoded and transmitted using a congestion controlled IP flow, experiences
a variety of quality of service due variations in video content activity and bandwidth availability. As a
consequence, the perceived quality of the video can suffer frequent oscillations which are particularly
disturbing to the viewer. We therefore tackle the problem of accommodating the mismatch between the
available transmission rate and the encoding rate required for stable perceived quality. By utilizing a
reliable metric of perceived quality, we develop a technique for source rate control of real-time live video
that maintains a more uniform quality. The method comprises an artificial neural network to generate
predictions of the on-going quality and a fuzzy rate-quality controller that considers properties of human
perception of quality in order to provide smooth streaming quality. Experimental results indicate that
in the presence of sufficient buffering, the proposed adaptation technique can improve quality stability,
while maintaining TCP-friendly transmission.
1
Introduction
Encoding and transmitting live video over the Internet is subject to significant variations in quality. This
is attributed to the video content’s inherently varying spatio-temporal complexity: video scenes with low
spatial activity and motion are easier to encode with good quality, while on the other hand, complex
visual content and motion increase the distortion introduced by the encoder. Furthermore, in order to
avoid congestion in network resources and be fair to compliant flows, media streams need to employ
congestion control to determine their fair-share of bandwidth and adapt their transmission rate to match
it [3]. Unfortunately, confining the source rate to the TCP-friendly rate of the stream results in frequent
fluctuations in video quality. Such variability in quality is extremely annoying to the human viewer; users
prefer video with medium but stable quality to a video that oscillates between high and low quality [5].
Most work on smooth quality video in the literature concerns with streaming of stored media and not live
video material. In such cases, the system has access to future frames in its disposal to perform encoding
optimizations like multiple-pass coding or efficient packet scheduling. Furthermore, measurement of the
video quality performance is usually limited to a few objective metrics like mean square error (MSE) and
peak signal-to-noise ratio (PSNR) (e.g., [27]). Work in [15] proposes smoothness criteria for layered streams
based on layer runs, defined as the number of consecutive frames in a layer. Notwithstanding the fact that
frequent oscillations in the number of transmitted layers result in variations in quality, the assumption that
layer smoothness coincides with quality smoothness cannot be substantiated. This problem is aggravated
by the fact that the algorithm presented works on layered CBR streams; it is known that CBR video
exhibits high quality variation. Kim and Ammar [13] extend this work to solve the problem of TCPfriendly streaming of layered FGS MPEG-4 video with minimum quality variation. Again, layer runs are
1
Temporal Width
Horizontal Width
Vertical Width
Fn
Fn+1
Fn+2
Fn+3
Fn+4
Fn+5
Figure 1: This figure illustrates the definition of the spatio-temporal (S-T) region
used as an indication of smoothness, which preserves the aforementioned disadvantages. Furthermore,
the experiments simulated the transmission of high rate streams( 4 Mbps); these rates can support high
quality video anyway and are not typical to what the majority of Internet users experience today.
This work presents a rate-quality adaptation method for live video that reduces fluctuations of quality.
Our method differs from other adaptation techniques that rely on inaccurate metrics to represent quality,
in its use of a realistic quality metric. Internal representations of the on-going quality are obtained
by an objective video quality metric [25], which is proved to provide ratings highly correlated to human
judgements of quality. The basic implementation concepts of this metric are introduced in the next section.
In section 3 we introduce the proposed system architecture, describe related terminology and present the
main challenges that arise. We firstly recognise that application of an objective quality metric places a
computational burden to the streaming server. Therefore, we develop a method for on-line estimation
of the ongoing quality, based on an artificial neural network (section 4), that yields accurate predictions
of the objective quality. We then present (section 5) a rate-quality controller based on the principles of
fuzzy logic that utilises predictions from the neural network and manipulates the encoding rate in order
to provide a smooth evolution of quality. Experimental results of the approach are presented therein.
2
Description of the objective quality metric
An important issue that arises with video adaptation is to understand its impact on video quality and
how to measure it. Pixel-error metrics (like MSE and PSNR) are widely used for this purpose, however,
they suffer a major drawback: they do not always correspond well with human judgements of quality [6],
especially at low-to-modest bit-rates (up to a few hundreds of Kbps). The main issue with MSE and PSNR
is that they cannot discriminate between impairments that humans can and cannot see or impairments
that are less or more annoying. Until recently, the most regarded method of measuring the quality of
digital video was by means of subjective quality assessment [11]. This method however, requires costly
and complex setups and therefore, it is not suitable for on-line quality monitoring. The answer to this
problem is the recent emergence of objective video quality metrics (VQM) (e.g., [8, 19, 25, 22, 24, 18]).
These are computational models that measure video quality in a way that preserves high correlation to
human ratings of quality, by accounting the type and magnitude of perceived distortions in the video
signal. Current research on objective quality metrics is at a considerably mature level and several models
are under evaluation and approaching standardisation [20].
We implemented and used the ITS VQM [25], proposed at the Institute for Telecommunication Sciences, to measure perceived quality. This metric has attained significant performance during a recent
2
TCPF
Congestion
Control
Encoded
Frames
Feature
Extraction
module
Rtcpf(t)
ANN
Quality
Predictor
Content
Features
Encoder
R
Qtcpf(t)
Qtcpf(t-1)
Q(R)
quality error
Rate-quality
controller
Renc(t)
Qtarget(t)
Input Frames
Send/Recv
buffer
monitor
Fuzzy Quality
Adaptive
Smoother
buflev
Figure 2: Components of the smooth quality adaptation framework and the interactions between involved
modules.
evaluation by the Video Quality Experts Group (VQEG) [21]. Its algorithm is based on the extraction
and statistical summarization of scalar, spatio-temporal features from the original and degraded video
frames, to obtain a single measure of perceived distortion. Summarization of these features occurs within
spatio-temporal (S-T) regions, usually 8 × 8 pixels × 6 frames (Figure 1). For each S-T region, features
from both the original and distorted frames are compared using functions that resemble human perception
and visual masking, to obtain measures that quantify the level of perceptual distortions present (tiling,
blurring, motion jerkiness, etc.). The calculated measures of perceptual impairments from each S-T region
are then pooled for the temporal duration of the S-T region (6 frames) and summarised by averaging the
worst 5% of measured distortions (this reflects the fact that subjective quality is primarily determined
by the worst quality during the observation period). A single score of perceptual distortion for the whole
video sequence (typically 8-10 sec long) is obtained by averaging the measured impairments of every 6frame evaluation period (which we call S-T period ) over the duration of the clip. Score values are in the
[0, 1] range, with zero corresponding to a sequence with imperceptible impairments and one to a heavily
distorted video. If D(t) is the S-T period perceptual distortion value produced by the ITS VQM at time
index t, then the value 1 − D(t) can be considered as a reasonable representation of the instantaneous
quality for the purposes of realtime quality monitoring. We call this value S-T period quality. S-T period
quality scores are scaled to the [0,100] range (with 0 representing unacceptable quality and 100 perfect
quality).
3
Problem formulation and proposed architecture
Achieving real-time video streaming with consistent quality requires a method that manipulates the source
rate so that more bits are allocated to scenes or frames with high spatial and temporal energy. In other
words, the problem is how to control the encoding rate, denoted Renc , in the presence of a variable and
unknown available transmission rate (the TCP-friendly rate of the stream, denoted Rtcpf ), so that the
resulting target quality Qtarget , is a smoothed alternative of the quality that the encoder would have
produced if the video rate was set to Rtcpf (denoted Qtcpf ). To enable a quality-based rate adaptation,
3
a method that associates an encoding bit-rate to the resulting encoding quality in realtime is required.
Then, appropriate target quality values can be continuously chosen for successive S-T periods. A smoother
quality would mean that at times Qtarget is higher than Qtcpf , while at other times it is lower. Consequently,
a similar relationship would occur between Renc and Rtarget . Note that the sender is always transmitting
at its TCP-friendly rate, therefore, mismatches between the two rates are accommodated using a sender
and a receiver buffer. Hence, the system has to maintain buffer stability at the same time.
The ITS VQM can be used to obtain continuous S-T period quality scores as described in section 2.
Doing so however, requires encoding and decoding of the S-T period frames at several candidate bitrates, and the subsequent application of the metric. This approach is prohibitive in terms of real-time
performance, a strict requirement of live video coding. In order to bypass this time-consuming process,
our system utilises an artificial neural network (ANN) to automatically generate accurate predictions of
the continuous S-T period quality scores when presented with details of the content features of the input
frames and a target encoding rate. S-T quality scores obtained from our implementation of the ITS VQM
are used to train the ANN. The details of the ANN quality predictor are presented in section 4.
Figure 2 illustrates the architecture and the components of the proposed system. A companion congestion control module is periodically sampled to elicit the nominal transmission bit-rate of the stream
(Rtcpf ). Although the proposed system is not bound to a specific transmission control policy, we assume
TCP-friendly congestion control [4]. The video encoder receives video frames from a live video source
(camera, satellite feed, etc.) with the task of producing a compressed bitstream. Every S-T period t,
summary content statistics of video features are extracted (cf. section 4) from a small number (six) of
consecutive frames. Based on content features statistics, that reflect the complexity of the underlying
visual content, and the current nominal transmission rate Rtcpf (t), the neural network generates a prediction of the resulting quality, Qtcpf (t). The sampling of the TCP-friendly rate and the estimation of the
continuous quality scores are therefore carried out at a period equal to the duration of the S-T period (i.e.,
every 6 frames, or, 200 ms for a 30 frames per second input video). This period is an efficient tradeoff
between a suitable granularity of network adaptation1 and the duration of the quality evaluation period of
the ITS VQM. It also minimises the additional delay and buffering requirement at the sender. Finally, a
fuzzy rate-quality controller receives successive values of Qtcpf and an estimate of the sender and receiver
buffer sizes to determine a value for Qtarget that achieves the desired encoding quality and maintains the
stability of both send and receive buffers. The function of the controller is to locate, by further invocations
of the ANN, the encoding bit-rate, Renc , that approximates Qtarget . We discuss why a controller based
on the principles of fuzzy logic is a good choice for our system, and present how it determines the target
quality Qtarget , in section 5.
4
Neural network quality predictor
An Artificial Neural Network (ANN) is a general, practical form of machine learning, that provides a robust
approach of approximating real, discrete or vector target functions, and learns to interpret complex realworld data. When suitably trained, ANNs can provide accurate estimation of the output(s) based on a
selection of inputs, efficiently predict non-linear relationships among multidimensional data and support
a general paradigm to deal with complex mathematical functions. Extensive research in this area has
resulted into a multitude of approaches to neural network computing; we limit our discussion to the very
basic principles the govern an ANN and to the most popular type of ANN, multi-layer perceptron with
error back-propagation [7, 17], which is the one used in this work.
The basic building block of a neural network is an elementary neuron, or perceptron (Figure 3). Each
1
The network adaptation timescale is dictated by the frequency of incoming acknowledgements of the TCP-friendly
protocol.
4
Perceptron
...
x3
w1
w2
w3
...
x2
wn
W1
w11,1
a
f(
wi xi b)
i 1
Output
Input
Layer
x3
b=1
...
xn
Inputs
w1
m,n
xn
W2
a1
b1
x2
n
f
f
w2
w21,1
f
a2
b2
f
w2
w21,2
f
~
y
w21,m
...
x1
x1
ANN with one hidden
layer
am
Output Layer
bm
Hidden Layer
Figure 3: The structure of the basic ANN component – a neuron or perceptron, and a feedforward neural
network with n inputs, one hidden layer with m neurons, and one output layer. WL is the weights matrix
for layer L.
input vector xT = [x1 , x2 , ..., xn ] is weighted with an appropriate weight wi , that defines the contribution
of input xi to the perceptron’s output α. The sum of weighted inputs together with a bias b are projected
on a differentiable transfer function f , to produce the output of the neuron or, activation
n
X
α = f(
wi xi + b)
i=0
Layers of several perceptrons can be combined to form a multi-layer feedforward network (Figure 3).
Feedforward networks often have one or more hidden layers of non-linear neurons. Multiple layers of
perceptrons with non-linear transfer functions, like log-sigmoid, or tangent-sigmoid 2 , allow the network
to learn linear and non-linear relationships between inputs and output(s), without a-priori assumption
of a specific model form. The function of a neural network is to determine suitable values for a set of
adjustable parameters, like the weights and biases at every layer and neuron, by performing an iterative
procedure, called training or learning, on the set of train samples. These adjustable parameters are given
random initial values, and the training process consists of two steps per iteration. For a set of training
input vectors with a known response y, a forward pass calculates all the activations at every neuron to
generate a predicted response ỹ. Then, a backpropagation step is used to adjust all the weights of the
neural network based on the magnitude of the error between the predicted and actual output
1
E k = (y k − ỹ k )2
2
E=
K
X
Ek
k=1
2
logsig(x) =
1
(1 + βe−x )
tansig(x) =
ex − e−x
ex + e−x
5
(1)
(2)
where E k is the network error vector for the training pattern xk and K is the total number of training
patterns. The error cost measure in expression 1 is commonly used for its simplicity and it presents the
deviation of the network’s output from the ideal. The task of the training is to find the weights and biases
that minimise E. This iterative procedure with new optimised parameters is repeated until an acceptable
low error is achieved. There are several algorithms proposed to adjust the weights at every iteration of
the training phase, and the gradient descent is probably the most popular [2]. Essentially, this method
performs iterative steps in the weight space, proportional to the negative gradient of the cost function E
to update the weights
wij = wij + ∆wij
∂E
∆wij = −η
∂wij
X ∂E k
∂E
=
,
∂wij
∂wij
∀k
where η is the step size parameter, usually called the learning rate. An ANN is therefore an optimisation
technique that attempts to locate the minimum of a multidimensional error surface, which usually includes
several local minima. A neural network might not always find the absolute minimum, but an acceptable
local minimum close to it. After the training phase, the ANN can be validated for its generalisation
capability, by comparing its output with the actual (expected) values, where the input data come from
a set of (unknown during the training phase) samples, called the test set. A usual problem that occurs
during the training process is over-fitting. The error on the training set may be reduced to a very small
value, but when presented with new, unknown test patterns, the network has poor performance (large
prediction error) because it has almost memorized the training samples. The tendency for over-fitting
increases with the network size, but deciding what is the best size for the network is difficult to know
beforehand. Early stopping is a technique that is very often used to stop the training process before the
network starts to over-fit. In this method, the available data for training can be split into a training set
and a monitoring set, and the error of the monitoring set is also inspected during training. While at
the beginning both the training and monitoring errors decrease, when the network begins to over-fit the
training data, the monitoring error will start to increase. If this increase continues for a specific number
of iterations, the training process is stopped and the ANN parameters (weights and biases) that presented
the minimum monitoring error are retained.
The motivation behind the use of neural networks, is that the encoding quality of video is primarily
depending on the source rate and the level of spatial activity and motion in the video scene3 . Therefore,
the ANN model operates on visual content descriptors that are extracted from the input video frames
during the encoding process, on a S-T period basis, and directly yields objective quality scores that are
highly correlated to the corresponding quality values if the ITS VQM had been used. The function that
maps content feature vectors into objective quality ratings is learned by training the neural network. For
the (off-line) training process, continuous objective quality scores are obtained by directly using the ITS
VQM. The ANN method does not rely on the availability of the distorted version of video frames during
its real-time operation. Quality predictions are sought based only on features that are extracted from the
original input frames. The main challenge of the ANN design is the extraction of appropriate features from
the visual content. These features should (i) adequately represent most of the spatio-temporal activity of
video content and (ii), since realtime performance is important, they should be calculated as part of the
normal operation of the encoder, so that no significant overhead occurs.
Keeping in mind the requirement of real-time processing, a set of content features, summarised in
Table 1, are extracted from every original frame within the S-T period. Four of these features measure
3
Under the reasonable assumptions of a non-changing picture size (resolution) and video codec.
6
texture complexity: the pixel activity, PelAct, defined as the standard deviation of luminance pixels in
each (8 × 8) block averaged over the number of blocks in the frame, and the spatial spread of pixel
activity, PelActSpread, defined as the deviation of block-PelAct values over the frame. Similar features
are calculated to measure the ’edges’ activity within a frame. Edges convey significant visual information,
reveal texture, and are more susceptible to certain encoding impairments in comparison to flat regions
of the image (e.g., blurring distorts the intensity of edges). From a human visual system point of view,
spatial and texture masking are sensitive to the intensity of areas with edge activity. To determine the edge
activity, we calculate the magnitude of pixel gradients in each block, by applying a Sobel filter (gradient
operator) at each pixel value:
magn(∇pi,j ) = |pi−1,j−1 + 2pi−1,j + pi−1,j+1 − pi+1,j−1 − 2pi+1,j − pi+1,j+1 | + |pi−1,j−1 + 2pi,j−1 +
pi+1,j−1 − pi−1,j+1 − 2pi,j+1 − pi+1,j+1 |,
where pi,j is the luminance value of the pixel at row i and column j in the frame’s pixel grid. The
edge activity, EdgeAct is the standard deviation of magn(∇pi,j ) values in every block, averaged over the
number of frame blocks. The spread of edge activity, EdgeActSpread is calculated similar to PelActSpread.
Motion related features are also extracted with the aim of covering the range of motion attributes. The sum
of absolute pixel differences, soad, is a measure of pixel change between the current (motion-estimated)
frame and its reference frame. The average magnitude of the motion vectors (MV) over the whole frame,
MVMagn, and the spatial variance of the MVs magnitude, MVMagnVar, are also calculated. To locate
frames where strong motion in portions of the image may lead to localised impairments, the average
magnitude of MVs is also measured for each of the four spatial quadrants of the frame, resulting in four
additional features, MVMagnUL, MVMagnUR, MVMagnLL, and MVMagnLR. The ratio of the motion
estimated macroblocks (MB) over the total number of MBs, MERatio, is also calculated as a representative
measure of the coding efficiency of the motion estimation process. Motion complexity, MotCompl, is
calculated as follows: motion vectors are classified according to the dominant axis of the vector (up,
down, left, right, none), and the variance of this five-bin histogram is taken. A uniform histogram of the
directional MVs reveals a more complex motion throughout the frame. MotDirChange represent changes
in the motion direction, and is formed by subtracting the MVs of successive motion estimated blocks, and
averaging over the number of macroblocks in the frame:
M otDirChange =
1 X
kmvF (i) − mvF 0 (i)k,
M i
where F 0 represents the reference frame of frame F used for motion estimation. MotAccel captures
the change in the motion speed (acceleration), again averaged over the number of MBs:
1 X
M otAccel =
(kmvF (i)k − kmvF 0 (i)k).
M i
Descriptive statistics of these features are then calculated over the 6-frame period to obtain content
feature descriptors. These summary statistics are the mean, median, standard deviation, minimium and
maximum values, the 5, 25, 75 and 95-percentiles. In total, one hundred and thirty five (135) content
features descriptors are gathered per S-T activity period. We modified a H.263+ video codec to perform
the feature extraction process and the calculation of their descriptive statistics (this process can be applied,
with minimal modifications, to any other hybrid DCT-based codec that employs motion estimation).
4.1
Neural network architecture and prediction performance
The ANN architecture comprises a two-layer feedforward network with backpropagation, with one hidden
layer of nh neurons and non-linear (tangent-sigmoid ) transfer function, and a linear output layer. In
7
Table 1: Content features extracted from the original video frames
Content Feature
Description
PelAct
Pixel activity averaged over all blocks
PelActSpread
Deviation of pixel activity over all blocks
EdgeAct
Edges activity averaged over all blocks
EdgeActSpread
Deviation of pixel activity over all blocks
soad
Sum of abs. pixel differences between adjacent frames
MVMagn
Magnitude of motion vectors
MVMagnVar
Spatial variance of motion vector magnitudes
MVMagnLL, MVMagnLR, Magnitude of motion vectors per quadrant – low left & right,
MVMagnUL, MVMagnUR upper left & right
MERatio
The ratio of motion estimated MBs in the frame
MotCompl
Motion complexity (variance of the directional motion vectors
histograms)
MotDirChange
Change of motion direction between adjacent frames
MotAccel
Acceleration of motion between adjacent frames
order to reduce the size of the input vectors that train the neural network, remove redundancies present
among the original 135 inputs and retain those variables that are relevant to the model (thus improve
both training time and generalisation performance) a data dimensionality reduction process precedes the
ANN training process. The first step applies Principal Component Analysis (PCA) on the input data
matrix. Principal Component Analysis [12] is a data dimensionality reduction scheme, which is very often
used in neural networks. This data compression technique extracts characteristic features from the data
whilst minimizing the information loss. The basic principle of PCA is the representation of the data by a
reduced set of unit vectors (eigenvectors). The eigenvectors are positioned along the directions of greatest
data variance so that the projections from the data points onto the axis of the vector are minimized across
the full data set. PCA is applied to the train input vectors (calibration matrix). Therefore, if Fn×m
is the calibration matrix and Pm×m is the principal component transformation matrix, the transformed
calibration set of patterns is the (n × m) matrix F 0 = F × P. Note that, the same transformation has
to be applied to the set of test patterns as well, using the same transformation matrix P derived from
the calibration matrix. Usually, most of the data variance can be explained using the first few principal
components (PCs) of F 0 . While it difficult to consider the significance of the original input variables to
the model, it becomes much easier to do so when the input data are preconditioned with PCA. Input
features that are relevant to the model can be derived through a stepwise, trial and error method. For
example, with stepwise addition, one may start with an initial small set of inputs (first few PCs), and add
a new variable at the time until a satisfactory monitoring or prediction error is achieved. This has the risk
that the method may stop with selected input variables P C1 , ..., P Cm , but some important information
to the model may also be contained in input P Cn , n > m. With stepwise elimination, a deliberately
large subset of initial variables is chosen, and variables are subsequently removed until the monitoring
or prediction error improves no longer. The selection process of the appropriate input variables can be
improved if the relevance of each variable to the model, called its sensitivity, can be estimated. We use a
two-variance-based approach for variable sensitivity determination, proposed in [1]. This method is based
on the estimation of the individual contribution of each input variable to the variance of the predicted
response of the neural network (which can be derived when the neural network is trained when all the
input variables, except the one under consideration, are set to zero). Once all sensitivities are estimated,
the variable with the lowest sensitivity is tentatively removed and the ANN is retrained. If the monitoring
8
+
+
+
+
+
++
+ +
+
60
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
++
+
+
+
+
+
++
+
+
+ + +
++
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+
++ + + +
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
40
+
+
+
+
+
o
+
*
20
0
S−T period quality score
80
+
+
+
+ +
++
actual scores
ANN predictions
absolute error
*
*
*
*
*
*
**
*
** *
* **
*
* * * * **
* * *
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
** * *** **** * ** *
*
* **** **** * * ** ***** * *** ** * * ** * *** *
**
** ***
*
*
0
20
40
60
80
100
Figure 4: This plot depicts a series of neural network predictions together with the actual values obtained
from the ITS VQM. The prediction residual is also shown.
error decreases, the variable is deemed irrelevant to the model and is removed, otherwise, it is replaced
and the process continues with the next variable. At the end of this process, a subset of the initial input
features presents the new input features set.
In order to determine which subset of principal components (PCs) are relevant to the model, a
sensitivity-based, stepwise elimination process is then applied. First, the relevance, or sensitivity of each
variable (PC) to the model is estimated, using a two-variance-based approach [1]. This method estimates
the individual contribution of each PC to the variance of the predicted response of the neural network
(derived when the neural network is trained with all input variables but the one under consideration set to
zero). Once all sensitivities are estimated, the variable with the lowest sensitivity is tentatively removed
and the ANN is re-trained. If the monitoring error decreases4 , the variable is deemed irrelevant to the
model and is removed, otherwise, it is retained and the process continues with the next variable. The
stepwise elimination process retained a total of eighteen (18) input variables.
A large collection of video scenes were selected to test the proposed neural network model, featuring
a wide range of content, camera actions (static, panning, zooming, fades, etc.), and various levels of
scene activity. Video frames were extracted from: action movies (The Matrix, Terminator, XMen), sports
(extended football clips from the English Premiership) and also several short video clips from the VQEG
web site [20]. In total, the test video library contained approximately 39,000 frames (6,500 S-T periods).
From the set of 6,500 patterns, 80% were randomly chosen as the training set and the rest 20% as the
validation (test) set. One fourth of the training samples formed the monitoring set, and the rest was the
actual training set. All sequences in the video library where encoded at several rates, ranging from 100
Kbps to 2 Mbps (with a step of 100Kbps). Multiple neural networks were then trained, each corresponding
to a distinct encoding rate in the selected range. We performed the sensitivity analysis and the stepwise
elimination of input variables for various configurations of hidden layer neurons. This analysis proved that
the value of nh does not significantly affect the prediction performance of the ANN, nevertheless, for the
specific data set, a network with nh = 18 produced the smallest monitoring error.
We investigated the ANN prediction ability when presented with unknown input patterns. To clearly
4
During the training phase a monitoring set is also fed to the ANN to facilitate better training and improve the generalisation performance by preventing the network from memorizing the training samples (over-fitting).
9
80
60
40
20
Quality prediction
100
Actual vs. predicted scores
20
40
60
80
100
Actual quality score
Figure 5: Actual quality scores vs. ANN prediction for the test set (400 Kbps).
visualise that the neural network predictions closely follow the actual S-T quality scores, we plot in
Figure 4 a 100-samples long subset of the ANN outputs together with the corresponding expected scores.
The bottom line on the same graph corresponds to the absolute error between the actual score and the
ANN prediction. Figure 5 shows, for the test set of features (approx. 1300 input patterns, encoding
rate: 400Kbps), the actual objective S-T quality scores obtained from the ITS VQM plotted against the
corresponding outputs of the neural network. The neural network achieved significant prediction accuracy:
the Person correlation between the predicted and expected responses was as high as 0.901, the mean of
the absolute residual error was 4.20 with a standard deviation of 3.54. Similar generalisation performance
was gained for various encoding bit-rates from the range 100-2000Kbps used in our experiments, as shown
in Figure 65 .
4.1.1
Examination of additional overhead
The on-line quality predictor introduces two additional processing modules in the live-streaming system:
the extraction and statistical manipulation of content features inside the video codec, and the invocation
of the neural network quality predictor.
The overhead of the feature extraction process and statistical manipulation of the data to the video
encoder is not significant. Most chosen features, like pixel activity, soad, motion vectors, apart from the
edges energy, are calculated as part of the encoding process, namely for motion estimation, so no additional
delay occurs. Features like complexity of motion, acceleration and direction of motion are computed from
the values of the motion vectors using simple statistics. Calculation of edge activity in the frame is adding
a slight overhead (the Sobel gradient involves twelve additions per pixel)6 . The rest of the processing cost
involves the statistical summarisation of both frame-level features (mean and standard deviation over the
frame) and S-T period level content descriptors. In total, the additional overhead is on average less than
5
The prediction error of the ANN is even smaller at higher bit-rates, because quality values were usually in the high-end
of the scale, allowing the neural network to learn better.
6
Furthermore, we can remove it from the features vector with only a small loss in the prediction accuracy of the ANN
10
Absolute prediction error of the ANN
10
Average abs. error
9
8
Absolute error
7
6
5
4
3
2
1
0
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Encoding bit rate points (Kbps)
Figure 6: ANN prediction error at different encoding bit-rates. Error bars extend to the 5 and 95
percentiles.
15ms per 6-frame activity period for CIF-size frames on a 2.2GHz processor) therefore it does not affect
real-time performance. Similarly, the overhead of the neural network is also negligible: by nature, an
ANN might require significant amount of time to train but the process of calculating a response involves
a mere number of operations on the input variables vector.
5
Estimation of encoding rate
We can achieve a more stable target quality Qtarget by smoothing out transient increases and especially
drops of Qtcpf . At the same time, Qtarget has to be responsive to consistent changes in Qtcpf . As Qtarget
deviates from Qtcpf , so does the encoding rate Renc in relation to the transmission rate Rtcpf . While
mismatches between the source and channel rates can be alleviated by the sender and receiver buffers in
the short term, Qtarget has to follow the ’trend’ of Qtcpf in the longer-term. The basis of the approach is
to calculate the target quality value as an moving average (MA) of Qtcpf :
Qtarget (t) = α · Qtarget (t − 1) + (1 − α) · Qtcpf (t), α ∈ [0, 1]
for successive S-T periods t. MA predictors are quite simple but the main design difficulty is the choice
of weight α. Given that in practice the variation of Qtcpf is unknown, setting α to a high value leads
to successful elimination of large variations but lacks responsiveness and compromises the stability of
buffers, while a small value fails to decrease variations. The desired approach is to determine α on-line,
according to changes in Qtcpf and the status of the two buffers. We introduce a fuzzy logic controller [14]
to dynamically calculate appropriate values for α. Fuzzy logic was introduced by Zadeh [26] to describe
vagueness in system behaviour, where variables or parameters do not exhibit exclusive set membership,
but a gradual transition between states (or, a grade of membership). In our case, a fuzzy controller is
useful because, while it is difficult to analytically describe the system’s behaviour, as the output rate of
the encoder can hardly be characterised accurately and the transmission rate is unknown beforehand, we
know qualitative what the behaviour of the system should be.
11
−0.4
−1.0
−0.5
possmall
−0.2
0.2
0.0
0.4
0.5
1.0
medium
high
low
0.3
0.0
0.2
0.7
0.4
0.6
0.8
1.0
medium large
small
0.0 0.2 0.4 0.6 0.8 1.0
negsmall
α
buflev
poslarge
zero
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
error
neglarge
0.3
0.0
0.2
0.4
0.4
0.65 0.75 0.92
0.6
0.8
1.0
Figure 7: Membership functions of all fuzzy sets for the controller’s inputs (error and buflev ) and output
(parameter α).
With respect to the sender and receiver buffer sizes, the difference between the encoding and transmission rates has the following effects: if Renc is higher than Rtcpf then the sender buffer size increases, if
equal, it remains unchanged, otherwise it decreases. Similarly, the receiver buffer fills, remains unchanged
or empties, when the transmission rate is higher, equal or lower than the video data playout rate. There
are two inputs to the fuzzy controller: error and buf lev, and one output, the EWMA parameter α.
The input error is associated with the change in the value of Qtcpf (or quality error ) between successive
ST-periods: ∆Qtcpf = Qtcpf (t) − Qtcpf (t − 1). This change represents the level of short-term variation in
Qtcpf that we like to curtail7 . The input buf lev ∈ [0, 1] reveals how buffered data is distributed between
the send and receive buffers, and is defined as: buf lev = Br /(Br + Bs ), where Bs , Br are the sender and
receiver buffer sizes respectively. The value of this variable is a convenient way to establish whether the
sender or receiver buffer level is low. If the receive buffer runs low (Br → 0) then buf lev → 0, while if the
send buffer approaches underflow levels (Bs → 0), buf lev → 1. Therefore, the system can determine at
any time how video data is distributed between the two buffers, monitor if any buffer runs at low levels
and react accordingly. Note that, for low packet loss rates, the sender can quite accurately track the size
of the receiver buffer at any time, by continuously updating a buffer sz variable:
buf f er sz(t) = buf f er sz(t − 1) + (Rtcpf (t) − Renc (t)) · T
where, T is the duration of an adaptation period. Packet loss can however contaminate the accuracy of
this estimate. Alternatively, a method that lets the receiver feedback this information back to the sender
is preferable.
We define five gradations for the fuzzy input error (linguistic values that error takes on): large negative
(neglarge), small negative (negsmall), zero, positive small (possmall), and positive large (poslarge).
The buf lev variable takes on three linguistic values: low, medium and high. These gradations are
enough to describe the different states of both buffers; adding further gradations does not present any
obvious advantage and introduces unnecessary complexity. A fuzzy value ’low’ means that the receiver
buffer runs low, a ’high’ value that the sender buffer’s level is low, while a ’medium’ value that there
is enough data distributed, more or less evenly, between the two buffers. Finally, the gradations of the
controller’s output α are represented by three linguistic variables: small, medium and high. We use
standard trapezoidal fuzzy sets and the corresponding membership functions are shown in Figure 7.
The last step in the design of the fuzzy controller is the definition of the rules that govern its operation.
The controller opts for preserving a fuzzy large α in order to preserve stable evolution of Qtarget . However,
when buf lev is fuzzy low and there is a fuzzy negative error, α needs to take a smaller value to avoid a
7
Input error is scaled in the [-1,1] range, using error = ∆Qtcpf /20, since the majority of ∆Qtcpf values are confined in
the [-20,20] range, as established through quality experiments with numerous video clips.
12
receiver buffer underflow. Qtarget then gets close to Qtcpf , thus avoiding a Renc value that is much higher
than the transmission rate Rtcpf . Analogous rules are employed when error is fuzzy positive and buf lev
is high, to avert a sender buffer underflow. Using this approach of quality control, preference is given to
avoid the target quality from dropping to low values. This is in accordance to subjective experiments that
reveal that quality during a time interval is primarily determined by the worst impairment observed and
that drops of quality have a greater negative impact than the positive effect of an equal in size quality
increase [25, 9]. Using the above guidelines, the complete set of control rules of the fuzzy controller are
defined as follows:
1. if error is neglarge and buflev is low then α is small
2. if error is neglarge and buflev is medium then α is large
3. if error is neglarge and buflev is high then α is large
4. if error is negsmall and buflev is low then α is medium
5. if error is negsmall and buflev is medium then α is large
6. if error is negsmall and buflev is high then α is large
7. if error is zero and buflev is low then α is large
8. if error is zero and buflev is medium then α is large
9. if error is zero and buflev is high then α is large
10. if error is possmall and buflev is low then α is large
11. if error is possmall and buflev is medium then α is large
12. if error is possmall and buflev is high then α is medium
13. if error is poslarge and buflev is low then α is large
14. if error is poslarge and buflev is medium then α is large
15. if error is poslarge and buflev is high then α is small
5.1
Performance of the quality controller
As discussed in section 4.1, the ANN network is trained to provide predictions at discrete operating
bit-rates R0 , R1 , ..., RN . By performing a small number of invocations of the ANN, the rate controller
performs the simple task of finding i ∈ {0, ..., N − 1}, such that, QRi ≤ Qtarget ≤ QRi+1 ). The overhead
of this process is insignificant. Assuming that QR is an increasing function of R, Renc is then found by
interpolating between Ri and Ri+1 . To avoid Renc getting much higher than Rtcpf , causing the receiver
buffer to drain quickly, we allow it to increase relative to the instantaneous receiver buffer occupancy, i.e.,
up to ratio = 1 + buf lev times more than Rtcpf during the same period (so ratio ∈ [1, 2]).
We investigated the ability of the fuzzy controller to (i) provide a smooth encoding quality and (ii)
avoid starvation of the sender and receiver buffers. In a simulated transmission scenario using ns-2 [16],
8000 video frames (≈ 260 sec) from an action movie (The Matrix ) were transmitted using a TCP-friendly
flow (TFRC) [4]. The test video sequence contained scenes with various levels of content activity. The
simulation topology was a typical dumbbell network with bottleneck bandwidth set to 10Mbps and delay
to 20ms. To create a realistic variation of bandwidth, a number background ON/OFF CBR flows, with
ON and OFF times drawn from a Pareto distribution [23] also traversed the bottleneck link. The mean
13
100
o
o
o
o
o
o
oooo ooo
oo
80
+
oo o
o+
+
oo
oo
+++
++
oo
o
+++
o
o
++ ++
oo
oo
o
o +
+
o
o o+
o
o
oo o
o
o
o
oo
+
oo oo
o
+
o ooo
ooo +
o
o
++ o +
oo
o
o
+
o
+
+
+
+
o
oo
+
o
+
o
++++o++++
o +
o
ooo
++
+++
++oo o
o
o
+
+
+
o oo
++ o ++++
o
+++
o
++
o
+++
++
+o
o
o
+
o +
o oo oo +
+
oo
o
oo
+
oo
+ ++
o
++
o o o+
+
o
+ o + o++ o oo
o
o
+
oo +
o
+
o
o +
+ + + o o o
oo
+
+
oo oo+
o
o
+
+
o
oo
o
o + ooo+
o
o o
++++
o
++++++o++
o +
o
+ +
o
o
++
+
oo +
+o+
+++
o
o
+++++++++
+
+o+
o
+
+++
o
++
o
oo
o + +
++++
o
+++
oo
oo
+
o + o +
+ +
+++ o ++ +
+o
++++o
o
o
o+
++ ++++
o +
o
oo + +
+
o
+++
o+
o
++++ ++++++++++++++++ ++ +++
o
o
o +
+
+
+
+
+
o
+
+
+
o +
o
+
o
ooo o o o +
+o+ + + o o
o
+ ++
o
++ + +
o o o o o o+
+ ++++++o
o
+
++ o
o o
o+
o
+
+
o
o
o
o
o
++ + o ++
+
o
o
o
+
+
++
o
++oo+++ o +
o o+
o
o
o o
+
o
o
++++
+ +++++
+ o +o
++ +
o
o o
o
+
+++
o
o
+++ + ++ ++
++++++o++++
o
o oo +
oo
+
oo
++++oo+++ o o +
o
+++++++++o+ ++++o o +++
o
o oo
o
+++++
o +
++
++++ + +
++++ oo+ +++++ o o+++ o +
o
o
++++
o
o+
o o + + o + oo o
+
o
++ ++ +
oo
o
o
o+
o
o
+ + ++o
o o o +
o o o+ +
+ + ++ o +
o
o
o o
o
o +
oo +
++++o++++ oo ++ oo++++o ++
++ +
oo
+ o
o
o
o +
+ ++ +
o o
o
oo
o o o+
oo
o
oo
oo +
++ +++++
++o+
+
o
o
+
+
o+
o
o
+
o
o
o
+
o
+
+
+
+++
+++o++ ++
o
oo
oo oo o
o
o
o
+
++++o+++ +
o
o
o
o
o
o
o
oo
+
+
o
o
o
o
o
+
o
+
+
+
o
o
o +
o
o
o o
60
Quality
o
o o
o
o
o oo
o
oo
o
o oo
o
o o
oo
o
o
oo o
o
o
o
oo o o
oo
o
o
o
oo
o
oo
o
oo
oo
o o
o
o
40
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
oo
oo
o
o o
o oo
o oo
oo o
ooo
oo
oo
o
o
o
o
o
o
o
o oo
o
o
oo
o oo
oooo
o o
o
oo o
o
o
o
ooo o
ooo o
o
oo
o
o
o
o
o
+
o
o
oo +
o +
o
o
o
o
o
o
+
o+
o+
o
o
o
o
o
o
o
o
o o
oooo
+++o+++o
o o
o
o
ooo
++++
o
o
+
o
oo
oo
o+
ooo
o oo
o
o
o+
o ooo
o
o
o
+
oo
++
o
o o+
o
o +
+++
+
++
o
o oo +
+
o
o
o
oo
o
+ ++
o
o
o oo +
o o
o
o
+ +
+
o
ooo+
oo o +
+o++++
o
o
+ +
o +
oo
o
+
+++o++o
+
+
+
o
o
o
o
o
o
+ oo
o +
o
o
oo
oo
+
oo
++
o
o
+
+++
++
o
o+
++++
o
+
o
o
+
+
o
+
+
+
+
o
o ++
o +
++
o
o
oo
+++
+
o
o
o
o
o o
+
++
++++
o
+++ +
++
o o +
++
o oo +
o
+++o++
o
+
+
++++
+
o
o
o
o o+
ooo
o + +
+ o++
o
o
o
o
+
+
ooo +
oo
+
++ +
o
o
+
o
+ ++ o
o
+ +o
++++
o
o
+
o
+
+
o
+
+
o
o
o
+
o
+++ o + +
o o
+ + +
o +
+
oo
oo
++
+++
o
o
o
+ + +++ o o
o
+ + +
o +
o +
o
o
+
+++
++
o
oo
o +
+
+
++
+
oo +
o +
o
o
o
+
+
+
+
o
o
o
o
o
+
o
o
+
+
+
o
+ o+ +
o
o o
o+ o +
+
+
++
o
o oo
+++
o
o
++
+
++++
o+
o
o
o+
oo +
o o
++++++
+ + o ++++ + ++++++++++ ++ + oo
+
o +ooo+
o
+ o+
oo
+++++++
o
+++
+++++++
o +
o
+++++++++ ++++ o+ +
o
o
+++++++++ + o
o+
+ ++
+++
o
o o
+
+
+ o o +++++ + +
o
o
oo
o +
o
o++
o
+
o
+
++
o
+
o
o
o
++ oo + +++
o
oo o
++++
o+
o
o
o
+o
o o+
+
+ +ooo
o
o+
o
+
o
oo
oo
+++++ o + o ++ o
++
o +
o
o
o +
++ ++++++++++++
o
o
o +
+
+
+
o
o +
o
ooo + o+
oo o +
+ o
o
o
+ o o
+
++
o+
o
+
o o
o
o o+
+
o
o
o
+o ++++++++++o+++
o
o o
+
++
o
o+ o +
+
o
o
+ ++
o+
++++
+++
+++o
o
++++++o+o++ o
o+
o o
+ ++
++ o o
++++ o++++++
o ooo +
++ o
oo o o o
oo
++
o +
o
++
o +
oo + +
+++
++ + ++++++ o o ++
o
+++++++++ +
++++++++
ooo
+
o o o +
+
+
o o
+
o
o
+
o
++
+
o
o
o
+
o
o ++
+ +
+ +++++++++o + +
++
+
+++++++++o++ +
o
+
o
oo
o
o
o
+
++
o
o
oo o
o
+
o + o
o
+ ++++++o+++ ++
+ + +o
+o
oo
+o
+ ++ o + o o+ ++
o+
ooo+
o
o
o
+
+
+
o
+
+
+
+
+
+
o
o
o
o
+
+
+++++ ++ oo ooo+
+
++ ++ +
+
o
o o
++++
o
o
o
o
o
+
+
+
o oo
o
oo
o
++++o+o++++ +++ +++++o + +
oo +
++ + + ooo
o +
o
o o +
o
o+
+
o o
+++ +++++++++o++++++o++o+++++
o
o
o
++++++ +++++ ++++++ +
o o
+
o
o
oo
++o+ +
o +
o
o+
+ ++
o
o
+++
+ o o+++ + oo
o
o o
oo
++++++
o
o
o
oo o
++++ +++ oo o o+++++++++++o++
o +
o
+
oo
o
o
oo
o
o
o
++ o
o
o oo o
+++o++++ o
o
oo
+++++++++++
o
o
o o o+
o
o
oo
o
o o
o o
o+
o
o ++
o
+
o o
o+
++o
o
o
oooo oo
o
o +
oo
o
o
o +
o
+ o +++++
+++ o
+
o
+ o oo
o
o+
o
o o
o o o
ooo
o
o+
+
+
+
+
o
o
o
o
o
o
o
o
o
++
o
o +
o
o
o
o
o o
o
o
+++ +++ +++++ +++++
+ ++ o ++
o
o oo o
o
o o
oo o
oo o +
+ ++o+ +++++
o + o o+o
o
o o
+ o+++
o
o +
o
o
o o o
o
+
o++
o
o
+
o
+
+
o
o
o
+
o
o
+++
o o+
o o o+
o
+++++++
o
oo o
++++ ++++ o oo
o
o
oo
oo
+ o +o
o
o
++++
o
o
o
o
o
oo o + oo
o
o
o
o
o
oo o
oo
o
o
o+
o oo
o
o
o o
++++ o
oo
o
o
o
oo
oo o
oo
o
o
o
o o
ooo
o o
o
oo
o
o
++
o
o
o
o
o
o
o
+o
o
o
o
o
o
oo
o
o
ooo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
+
Qtcpf
Qtarget
o
20
o
0
50
100
150
200
250
time (sec)
Figure 8: Improvement of quality stability. The proposed system achieves a smoother ongoing quality.
ON and OFF times were 1 sec and 2 sec respectively. The number of CBR flows was chosen so that
the bandwidth available to the TFRC sender was in the range of encoding bit-rates R0 and RN . The
parameters of the experiment were as following: R0 = 100Kbps, RN = 2000Kbps, and the initial buffer
built-up delay was 10sec (i.e., the first 5 sec are used to fill the sender buffer and the remaining 5 seconds
the receiver buffer). Note that, the sender buffer delay is usually not visible to the receiver application
since a ’broadcast-delay’ is commonly used in these kind of services.
Figure 8 illustrates the continuous quality for the two approaches: (i) when the encoding rate for each
6-frame region is determined by the TCP-friendly bandwidth share of the flow (Qtcf p ), and (ii) when
the encoding rate is obtained from the proposed fuzzy quality controller (Qtarget ). We observe that the
target quality determined by the fuzzy controller follows the trend of Qtcpf , a necessary condition for
transmission stability, but at the same time, it exhibits significantly less variation as the controller avoids
driving Qtarget to considerably low and high values.
We define the magnitude of quality change between successive S-T periods, ∆Q = |Q(t)) − Q(t − 1))|,
as a metric of quality smoothness. High ∆Q values indicate significant quality variation, while low ∆Qs
suggest a stable on-going quality. Figure 13 shows the distributions of |∆Qtcpf |, |∆Qtarget | and |∆Qactual |.
Qactual represents the quality that is finally achieved by the encoder. The side-by-side box-plots extend to
the minimum and maximum values of the observations, with the horizontal lines indicating the 0.25, 0.5
and 0.75 quantiles. We observe that the proposed system attains a more stable quality, while respecting its
TCP-friendly transmission rate (readers are encouraged to verify these results by viewing the reconstructed
video sequences as well as further results with other test sequences in [10]). Notice that the distributions
of |∆Qactual | are slightly ’flatter’ than those of |∆Qtarget |. Since the encoding rate is always restricted to
at most ratio times the available transmission rate, the encoder cannot always achieve as high quality as
targeted by the controller, hence at times Qactual 6= Qtarget .
Section 5 described how the fuzzy controller is designed to avoid buffer underflow situations. Sender
buffer underflows are caused by the buffer fill rate (Renc ) being consistently lower than the buffer drain rate
(the transmission rate Rtcpf ). Given that uncompressed frames are produced at a constant rate (e.g., 30
fps), the encoder cannot retrieve frames faster than this rate. On the other hand, receiver buffer underflow
happens when the receive buffer fill rate (Rtcpf ) is lower than the rate at which video data is consumed by
14
6
8
10
12
∆Qtcpf
∆Qtarget
∆Qactual
2
0
4
5
0
10
15
20
25
30
14
Quality change
6
6
8
8
10
10
12
12
playout delay (sec)
Figure 9: Distribution of |∆Qtcpf |, |∆Qtarget | and |∆Qactual | for different initial buffering times. Notice
the difference in the range of the y-axes.
the decoder8 . Figure 14 plots the evolution of sender and receiver buffer sizes for different initial buffering
times. These results indicate that the system is resilient to buffer underflows; under reasonable initial
buffering delays there are no buffer under-runs. Occasional receiver buffer underflows occur when the
initial buffering delay is very low (e.g., at around time 90 sec for 3 sec of initial receiver buffering). As
expected, the smaller the initial delay, the weaker is the capacity of both buffers to accommodate the
mismatches between the encoding and transmission rates. As a result, the controller reacts by decreasing
the control parameter α, reducing at the same time the smoothness of the encoding quality, as proven by
higher ∆Qactual values in Figure 13. However, the quality smoothing gain is significant even for low initial
buffer sizes.
6
Conclusion
We presented a method for smooth quality source rate adaptation of streamed live video. The method
uses a realistic metric of perceived quality and relies on the generalisation properties of an artificial neural
network to obtain accurate quality scores based on the candidate bit-rate and content features of the
video scene. A fuzzy rate-quality controller is proposed that manipulates the encoding rate based on
the quality of the recent past and the state of the sender and playout buffer to achieve stable streaming
quality. Experimental results as well as viewing of the produced video sequences [10], demonstrate that
the proposed method reduces variation in quality, while at the same time adheres to constraints by the
TCP-friendly rate. The proposed method tackles source rate-quality control of live video; error resilience
techniques (FEC, re-transmission, etc.) can be incrementally built on top to provide protection from
packet loss.
8
This rate is Renc (t − P ), where P is the playout delay.
15
sender buffer size (Bs)
600
6 sec
8 sec
10 sec
12 sec
500
KBytes
400
300
200
100
0
0
50
100
150
200
250
200
250
time (sec)
receiver buffer size (Br)
900
6 sec
8 sec
10 sec
12 sec
800
700
KBytes
600
500
400
300
200
100
0
0
50
100
150
time (sec)
Figure 10: Evolution of sender (top) and receiver (bottom) buffer sizes for different initial buffering times.
References
[1] F. Despagne and D.L. Massart, Variable selection for neural networks in multivariate calibration,
Chemometrics & Intelligent Laboratory Systems 40 (1998), 145–163.
[2] R. Fletcher, Practical methods of optimization, Unconstrained Optimasation, vol. 1, John Wiley &
Sons, New York, 1980.
[3] S. Floyd and K. Fall, Promoting the use of end-to-end congestion control in the internet, IEEE/ACM
Transactions on Networking 7 (1999), no. 4, 458–472.
[4] S. Floyd, M. Handley, J. Padhye, and J. Widmer, Equation-based congestion control for unicast
applications, ACM SIGCOMM ’00 (Stockholm, Sweden), August 2000, pp. 43–56.
[5] B. Girod, Psychovisual aspects of image communication, Signal Processing 28 (1992), no. 3, 239–251.
16
[6]
, What’s wrong with mean-squared error?, Digital Images and Human Vision (A.B. Watson,
ed.), MIT Press, Cambridge, MA, USA, 1993, pp. 207–220.
[7] M. T. Hagan, H. B. Demuth, and M. H. Beale, Neural netowrk design, PWS Publishing Company,
Boston, MA, 1996.
[8] T. Hamada, S. Miyaji, and S. Matsumoto, Picture quality assessment system by three-layered bottomup noise weighting considering human visual perception, SMPTE Journal 108 (1999), no. 1, 20–26.
[9] D. Hands and S.E. Avons, Recency and duration neglect in subjective assessment of television picture
quality, Applied Cognitive Psychology 15 (2001), 639–657.
[10] http://www.cs.ucl.ac.uk/staff/d.miras/QAVideo/, 2004.
[11] International Telecommunications Union, ITU-T Recommendation BT.500-11, Methodology for the
subjective assessment of the quality of television pictures, 2002.
[12] I.T. Jolliffe, Principal component analysis, Springer-Verlag New York Inc., October 2002.
[13] T. Kim and M. H. Ammar, Optimal quality adaptation for MPEG-4 fine-grained scalable video, Proc.
of IEEE Infocom 2003 (San Francisco, CA, USA), March 2003.
[14] J. M. Mendel, Fuzzy logic systems for engineering: A tutorial, Proceedings of IEEE 83 (1995), no. 3,
345–377.
[15] S. Nelakuditi, R. R. Harinath, E. Kusmierek, and Z.-L. Zhang, Providing smoother quality layered
video stream, 10th Intl. Workshop on Network and Operating System Support for Digital Audio and
Video (NOSSDAV) (Chapel Hill, North Carolina, USA), June 2000.
[16] ns–2 Network Simulator, 1998, http://www-mash.cs.berkeley.edu/ns.
[17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error
propagation, Parallel Data Processing (D. Rumelhart and J. McClelland, eds.), vol. 1, MIT Press,
Cambridge, MA, 1986, pp. 318–362.
[18] K. T. Tan and M. Ghanbari, A multi-metric objective picture-quality measurement model for MPEG
video, IEEE Transactions on Circuits and Systems for Video Technology 10 (2000), no. 7, 1208–1213.
[19] The Alliance for Telecommunications Industry Solutions (ATIS), Objective perceptual quality measurement using a JND-based full reference technique, Tech. Report T1.TR.PP.75-2001, October 2001.
[20] The Video Quality Experts Group, VQEG, http://www.VQEG.org.
[21]
, Draft final report from the Video Quality Experts Group on the validation of objective models
of video quality assessment, Phase II, July 2003, Version 4.
[22] C. J. van den Braden Lambrecht and O. Verscheure, Perceptual quality measure using a spatiotemporal model of the human visual system, Proceedings of Digital Video Compression: Algorithms
and Technologies (San Jose, CA), 1996, pp. 450–461.
[23] W. Willinger, M. S. Taqqu, R. Sherman, and D. V. Wilson, Self-similarity trhough high-variability:
statistical analysis of Ethernet LAN traffic at source level, ACM SIGCOMM ’95 (Cambridge, MA),
August 1995.
17
[24] S. Winkler, A perceptual distortion metric for digital color video, Proceedings of SPIE Human Vsion
and Electronic Imaging (San Jose, CA), vol. 3644, 1999, pp. 175–184.
[25] S. Wolf and M. Pinson, Spatial-temporal distortion metrics for in-service quality monitoring of any
digital video system, Proceedings of SPIE International Symposium on Voice, Video and Data Communications (Boston, USA), September 1999, pp. 175–184.
[26] L. H. Zadeh, Outline of a new approach to the analysis of complex systems and decision processes,
IEEE Transactions on Systems, Man & Cybernetics 3 (1973), no. 1, 28–44.
[27] X. M. Zhang, A. Vetro, Y. Q. Shi, and H. Sun, Constant quality constrained rate allocation for
FGS-coded video, IEEE Transactions on Circuits and Systems for Video Technology 13 (2003), no. 2,
121–130.
A
Additional results
We present additional results that show the ability of the fuzzy controller to obtained using two further
video sequences: a 8000-frames long excerpt from the action movie Terminator and a 5400-frames long
sports clip with scenes from an English Premiership football match.
6
0
0
5
2
10
4
15
20
25
8
30
Quality change
6
6
8
8
10
10
12
12
16
16
playout delay (sec)
Figure 11: Distribution of |∆Qtcpf |, |∆Qtarget | and |∆Qactual | for different initial buffering times (sequence:
Terminator ).
18
sender buffer
receiver buffer
900
1200
6 s
8 s
10 s
12 s
16 s
800
6 s
8 s
10 s
12 s
16 s
1000
700
800
500
KBytes
KBytes
600
600
400
300
400
200
200
100
0
0
50
100
150
0
0
50
time (sec)
100
150
time (sec)
Figure 12: Evolution of sender (left) and receiver (right) buffer sizes for different initial buffering times
(sequence: Terminator ).
15
∆Qtcpf
∆Qtarget
∆Qactual
10
5
0
0
5
10
15
20
25
30
Quality change
6
6
8
8
10
10
12
12
16
16
playout delay (sec)
Figure 13: Distribution of |∆Qtcpf |, |∆Qtarget | and |∆Qactual | for different initial buffering times (sequence:
Football ).
19
sender buffer
receiver buffer
800
1200
6 s
8 s
10 s
12 s
16 s
700
6 s
8 s
10 s
12 s
16 s
1000
600
800
KBytes
KBytes
500
400
600
300
400
200
200
100
0
0
50
100
150
time (sec)
0
0
50
100
150
time (sec)
Figure 14: Evolution of sender (left) and receiver (right) buffer sizes for different initial buffering times
(sequence: Football ).
20
Download