MPEG Encoding on the Raw Microprocessor
by
Douglas J. DeAngelis
B.S., University of Maine 1988
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2006
@ Douglas J. DeAngelis. All rights reserved.
The author hereby grants to MIT permission to reproduce and
distribute publicly paper and electronic copies of this thesis document
in whole or in part.
Author ......
Departiiaext- fEWtrical Engineering and Computer Science
4 Dtember 14, 2005
Certified by..
Anant Agarwal
Prnfpqqor of E1Petrieal En ~
prino
n
Cnmntiifr Jqcience
,rvisor
Accepted by..
Shith
Chairman, Department Committee on Graduate Students
MASSA7CHUSMS IN-S
OF TECHNOLOGY
JUL 10 2006
LIBRARIES
BARKER
-E
MPEG Encoding on the Raw Microprocessor
by
Douglas J. DeAngelis
Submitted to the Department of Electrical Engineering and Computer Science
on September 14, 2005, in partial fulfillment of the
requirements for the degree of
M-aster of Science in Electrical Engineering and Computer Science
Abstract
This work investigates the real-world application of MPEG-2 encoding on the Raw
microprocessor. Much of the work involves partitioning the problem so that linear
speedup can be achieved with respect to both number of Raw tiles and size of the
video imnage being encoded. These speedups are confirmed through the use of real
Raw hardware (up to 16 tiles) and the Raw simulator (up to 64 tiles). It will also
be shown that processing power can be allocated at runtime, allowing a Raw system
of the future to dynamically assign any number of tiles to the encoding task. In
the process both the advantages and the limitations of the RAW architecture will be
discussed.
Thesis Supervisor: Anant Agarwal
Title: Professor of Electrical Engineering and Computer Science
Acknowledgments
The author would like to thank Hank Hoffmian for his oversight of this work and his
gentle encouragenent to press on in the darkest hours. Hank was responsible for the
initial port of the MPEG encoder to Raw which was the initial code base for this
research, as well as the enhancements to motion estimation that greatly improved the
system performance.
Maria Sensale deserves mention for her ability to extract the most. obscure papers
in electronic form, thereby saving untold visits to the reading room. Forza Italia!
Thanks to everyone in the Raw group for bringing to life a very cool computer.
Special thanks to Arrant Agarwal for encouraging the author to leave il 1992 and
encouraging him to return in 2005.
Lastly the author wishes to acknowledge his lovely wife Kim who put up with the
whole process while carrying our first child.
5
6
Contents
1
13
Introduction
1.1
Contributions . . . . . . . . . . . . . . . . . .
1.2
O rganization
.. .
.. . . .
14
. .. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
14
.
15
2 The MPEG Standard
2.1
MPEG Encoding ........
2.2
MPEG Syntax ........
15
.............................
16
...............................
3
Related Work
21
4
MPEG on the Raw Microprocessor
25
5
4.1
The Raw Microprocessor .......
4.2
Porting MPEG to Raw .......
.........................
25
...........................
26
29
Parallelizing MPEG on Raw
5.1
A Parallel Encoder
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
5.2
Conversion to Arbitrary Slice Size . . . . . . . . . . . . . . . . . . . .
31
6
Partitioning MPEG on Raw
37
7
Making Communication Efficient
41
7.1
Improvements to memcpy . . . . . . . . . . . . . . . . . . . .. . .
.
41
7.2
Improvements to sendrecv . . . . . . . . . . . . . . . . . . . . . . . .
42
7.2.1
Full Routing and Elemental Routes in R aw . . . . . . . . . . .
43
7.2.2
Border Routing........
7
. . . . . . . . . . . . . . . . .
48
7.2.3
8
9
Border Routing on Larger Raw Machines . . . . . . . . . . . .
52
. . . . . . . . . . . . . . . . . . . .
57
7.3
Alternate Macroblock Allocation
7.4
Suggested Improvements to Raw . . . . . . . . . . . . . . . . . . . . ..
58
61
Experimental Results
8.1
Phase Speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
8.2
Absolute Performance and Speedup . . . . . . . . . . . . . . . . . . .
63
8.3
Performance Across Sample Files
. . . . . . . . . . . . . . . . . . . .
70
8.4
Effect of Arbitrary Slice Size on Image Quality . . . . . . . . . . . . . .
70
8.5
Comparing Hardware and Simulated Results . . . . . . . . . . . . . .
72
75
Conclusion
9.1
Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
75
List of Figures
2-1
An MPEG encoder............................
2-2
M PEG syntax.
4-1
The Raw Microprocessor..
. 17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
. . . . . . . . . . . . . . . . . . . . . . . .
26
4-2
352x240 pixel video frame mapped to a 15 tile Raw machine. . . . . .
27
5-1
A parallel \IPEG encoder. . . . . . . . . . . . . . . . . . . . . . . . .
30
5-2
Individual tile activity for a baseline encoding. . . . . . . . . . . . . .
32
5-3
Slice map for 352x240 video on 16 tiles. . . . . . . . . . . . . . . . . .
34
5-4
Individual tile activity after parallelizing putpic.
. . . . . . . . . . .
35
7-1
Bordering macroblocks required for motion estimation.
. . . . . . . .
42
7-2
Example routes for Tile 0 sending data to all other tiles.
. . . . . . .
44
7-3
Possible routes for sending on Raw. . . . . . . . . . . . . . . . . . . .
45
7-4
Possible routes for receiving from the East on Raw. . . . . . . . . . .
46
7-5
Determination of the receiver opcode when Border Routing.
. . . . .
49
7-6
Example routes for Tile 3 sending to Tiles 2, 4 and 5. . . . . . . . . .
50
7-7
720x480 pixel video frame mapped to a 64 tile Raw machine. . . . . .
54
7-8
352x240 pixel video frame mapped to a 16 tile Raw machine using a
more square macroblock allocation scheme. . . . . . . . . . . . . . . .
59
. . . . . . . . . . . . . .
62
. . . . . . . . . . . . . . . . . . . . . . . . .
62
8-1
The video sequences chips360 and chips720.
8-2
The video sequence dhl.
8-3
Speedup of individual phases of the MPEG Algorithm for a 352x240
sequence on 16 tiles.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
9
64
8-4
Speedup and absolute performance for a 352x240 sequence on 16 tiles
using hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8-5
Speedup and absolute performance for a 720x480 sequence on 16 tiles
using hardw are. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8-6
67
Speedup and absolute performance for a 352x240 sequence on 64 tiles
using sim ulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8-7
65
68
Speedup and absolute performance for a 720x480 sequence on 64 tiles
using sim ulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
. . . . . . .
71
8-8
Comparison of speedups for the three sample sequences.
8-9
Comparison of a 720x480 encoding on the simulator and the hardware.
10
73
List of Tables
6.1
iMacroblock partitioning for a variety of available tiles and problem sizes. 38
7.1
Number of macroblocks required on each tile in full frame memcpy con. . . . . . . . . . . . . . . . . . .
43
7.2
Partial border routing map for 352x240 video on 16 tiles. . . . . . . .
50
7.3
Comparison of number of the network hops required to full route and
pared to memcpy with border data.
border route 352x240 video on 16 tiles. . . . . . . . . . . . . . . . . .
52
7.4
Partial border routing map for 720x480 video on 64 tiles. . . . . . . .
53
7.5
Partial border routing map for 720x480 video on 64 tiles with biased
op cod e.
7.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of number of the network hops required to full route and
border route 720x480 video on 64 tiles. . . . . . . . . . . . . . . . . .
7.7
55
57
Comparison of number of the network hops required to route 352x240
video on 16 tiles when using a more square macroblock allocation scheme. 58
8.1
Characteristics of the video files used for testing the MPEG encoder.
8.2
Frame rates for encoding various sequence sizes on various numbers of
tiles.
8.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
Comparison of SNR and file size for a 352x240 encoding using the
baseline code and the parallel putpic code ..
8.4
63
. . . . . . . . . . . . . .71
Calculating simulation correction factors for each phase of the MPEG
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
72
12'
Chapter 1
Introduction
As multimedia applications continue to demand greater performance on smaller platforms, the need for a general purpose multiprocessor that is also well suited to highbandwidth real-time stream processing is increasing. The Raw microprocessor[23] is
an example of just such a processor. The prototype Raw chip has 16 processors (called
tiles) arranged in a 4x4 grid and connected to each other via four 32-bit full-duplex
networks. With its ability to simultaneously move and operate on 128 bytes of data
every clock cycle, Raw is ideally suited to a variety of multimedia tasks.
The goal of this research is to explore the adaptation of a particular real-world
application to the Raw architecture. We chose MPEG-2 encoding for a, number of
reasons:
" There was an existing public domain code base[18] for the single processor case,
allowing us to spend time focusing on how best to adapt the problem to Raw.
" Unlike other more "traditional" benchmarks like specint and matmult, it is
relatively easy to explain the application in layman's terms. By the end of this
thesis we will, in fact, be able to give very real numbers on
just how many tiles
of a Raw chip would be required to encode DVD quality video in real-time.
* As the foundation technology behind DVDs and videoconferencing. it offers a
wide variety of avenues for further exploration.
13
1.1
Contributions
In the most general sense this thesis is intended to demonstrate that a general purpose
tiled architecture microprocessor such as Raw can exhibit performance characteristics previously only associated with custom architectures.
It will also demonstrate
that this performance can be extracted using using well understood programnning
toolsets (such as Gnu C) as opposed to requiring the programmer to have an intimate
knowledge of the underlying architecture.
More specifically, through the course of this work we hope to achieve the following
goals:
" A parallel implementation of the MPEG-2 encoding algorithm that overcomes
the major sequential dependency of MPEG while exploiting the Raw architecture to achieve the best baseline performance possible.
A well partitioned implementation of the MPEG-2 encoding algorithm that
allows an input data stream of arbitrary frame dimension to be mapped to an
arbitrary number of tiles at runtime.
* A scalable implementation of the MPEG-2 encoding algorithm that exhibits
linear speedups with the addition of tiles as well as linear speedup with the
reduction in video frame dimensions.
1.2
Organization
The rest of this thesis is organized as follows. Chapter 2 gives an overview of the
MPEG algorithm and MPEG encoding. Chapter 3 discusses previous methodologies
for parallelizing the MPEG algorithm. Chapter 4 is a brief introduction to the Raw
miicroprocessor. Chapters 5, 6 and 7 outline the implementation of MPEG encoding
on Raw. These chapters form the core of the thesis and address in turn the three
fundamental goals of parallel, well partitioned and scalable. Chapter 8 goes on to
demonstrate that these goals were met through experimentation.
14
Chapter 2
The MPEG Standard
The M.IPEG standard[9] describes a lossy compression algorithn for moving pictures.
The MPEG-1 standard was intended to support applications with a continuous transfer bitrate of around 1.5Mbps, matching the capability of early CD-ROMs. MPEG-2
uses the same basic compression methodology, but adds functionality to support interlaced video (such as NTSC), much higher bitrates and a numer of a number of other
features not particularly relevant to this research. MPEG-2 is backward compatible
with MPEG-1 [7].
2.1
MPEG Encoding
MPEG video encoding is similar to JPEG still picture encoding in that it uses the
Discrete Cosine Transform (DCT) to exploit spatial redundancy. A video encoder
which stopped there would generally be termed an intraframe coder. MPEG enhances
intraframe coding by employing an initial interframe coding phase known as motion
compensation which exploits the temporal redundancy between frames. As a result,
there are three types of picture in an MPEG data stream. The first type, known as the
intra (I) picture, is fully coded, much like a JPEG still image. The compression ratio
is low, but this picture is independent of any other pictures and can therefore act as
an entry point into the data stream for purposes of random access. The second type
of picture is the predicted (P) picture, which is formed using motion compensation
15
from a previous I or P picture. The final type is the bidirectional (B) picture, which is
formed using motion compensation from either previous or succeeding I or P pictures.
In a typical MPEG data stream, the compression ratio of a P picture might be 3x
that of an I picture, and the B picture might be 10x higher compression than an I
picture.
However, just encoding more B pictures will not necessarily imply better
compression since temporal redundancy will decrease as the B picture gets further
from its reference I or P picture.
For this reason, we use a value of N=15 (the
number of frames between successive I pictures) and M=3 (the number of frames
between successive I/P pictures) in this research.
A block diagram of an MPEG-2 encoder can be seen in Figure 2-1. The input
video is preprocessed into a luminance-chrominance color space (we use YUV where
Y is the luminance component and U and V are chrominance components).
If the
particular picture will not be an I-picture, it enters the Motion Estimation phase,
vhere motion vectors are calculated to determine locations in previous or succeeding
pictures that best match each area of the image. The Motion Compensation phase
then uses these vectors in combination with a frame store to create a predicted frame.
In the case of P or B pictures, it is the prediction error and not the actual picture
itself that enters the DCT phase. The output of the DCT stage is then quantized to
a variable level based on the bit budget for the output stream and fed into a variable
length coder (VLC) which acts as a final lossless compression stage. The quantized
output is also reversed to create a, reconstructed picture just as the decoder will see
it for purposes of future motion compensation.
2.2
MPEG Syntax
The MPEG bitstream is broken down into hierarchical layers as shown in Figure 22[11].
At the top is a sequence layer consisting of one or more groups of pictures. These
groups of pictures (GOP) are just that: collections of I, P and B pictures in display
order. Each GOP must have at least one I picture and as stated earlier, our reference
16
Uncompressed
Video (UV)
Rate Control 4
Inpu Frae
Input
DCT
Frame
Quantize
D
VLC
Ouput
Bitstream
MPEG
Video
Inverse
Quantize
Estimation
Inverse
DCT
Predicted Frame
Motion Vectors
Motion
Reconstructed
Compensation
Frame Buffer
Figure 2-1: An MPEG encoder.
GOP size is 15 pictures.
Each picture is composed of slices which can be as small as a single macroblock
but must begin and end on the same row[9]. Slices are primarily intended to allow
smoother error recovery since the decoder can drop a slice without dropping the whole
picture.
Slices are composed of macroblocks. Macroblocks are considered the minimum
coded unit and in our case always consist of four 8x8 blocks of luminance (Y), one 8x8
block each of two chromiiance values (U and V). The chroma values are subsampled
in both the horizontal and vertical dimensions during preprocessing to create what
is called a 4:2:0 inacroblock.
Since a macroblock covers a 16x16 pixel area of the
image, the horizontal and vertical dimensions of the picture are padded to be an even
multiple of 16. As one example, a picture which is 352x240 pixels would contain
17
M
GOPO
GOP1I
I-Picture
GOP2
B-Picture
Sequence Layer
GOP n
GOP3
B-Picture
P-Picture
* * *
P-Picture
Group of Pictures Layer
SliceO
Slicel
Slice2
MBO
MB1
MB2
MB n
Slice Layer
Slice n
Picture Layer
U
V
Y
Macroblock Layer
8x8 Block
Figure 2-2: MPEG syntax.
22x15 or 330 macroblocks.
The encoder implemented as part of this research was not intended to support all
features of the MPEG standard, although nothing in the methodology would preclude
a full implementation. It was simply more important to go deep into exploring the
parallelism that could be extracted than it was to have a code base that perfectly met
all the functionality of the standard. Specifically, the MPEG-2 encoder described in
this research:
" Only works with progressive (non-interlaced) video input sources
" Is only intended to create Main Profile (4:2:0) bitstreams (e.g., DVD standard)
The encoder is not limited in the horizontal and vertical resolution of the input
18
video sequence, however, as this is an important part of investigating the scalability
of the encoder.
19
20
Chapter 3
Related Work
Much work has been done in the area of creating parallel algorithms for MPEG
encoding on general purpose hardware. The vast majority of this work until recently
[25, 17, 21, 10, 1, 4, 8, 14, 19] has assumed that the entire video sequence is available
in advance of the encoding. This is often termed off-line encoding, and offers many
avenues for exploiting parallelism that focus around pre-processing of the sequence
and allocation of data at a coarse (typically GOP) level. The goal of these methods
is to reduce or remove the interdependencies inherent in the MPEG algorithm, and
as a result reduce or remove the corresponding high-latency communication between
processing elements. The bulk of these efforts targeted a network of workstations,
although some [21, 1] used a MIMD architecture such as the Intel Paragon and another
specifically targeted shared mnemory ntmiltiprocessors [14].
A variant of off-line encoding is what could be called on-line with constant delay
[22]. Encoding methods are similar to off-line, but there exists a, constant delay in the
output sequence relative to the input sequence (on the order of 10 to 30 seconds in
this study). If this delay is acceptable, then the encoding can appear to be happening
in real-time.
Our focus, however, is on true real-time on-line encoding, where individual frames
arrive at the encoder at a constant rate (typically on the order of 30 framnes/sec) and
each one is presented at the output before the next frame arrives. In this environment,
it is not possible to exploit any temporal parallelism [22]. Obvious applications are in
21
videoconferencing or live digital video broadcastino or even narrowcasting (e.g., using
a webcam or cell phone). An early effort at this problem [3, 15] used a data-parallel
approach for MIMD and network of workstation architectures. This method achieved
an impressive for the day 30 frames per second on a 330 processor Intel Paragon when
using a simplified motion estimation algorithm on 352x288 data, but did not scale
well. More recent work [5] has attempted to address the problem of on-line encoding
by modeling the fine-grain parallelism in the MPEG-4 algorithm, but does not yet
offer any real world performance data for their methods.
Kodaka [16] and Iwata [11] both describe MPEG encoding algorithms that appear
to have the potential to achieve real-time on-line encoding on general purpose single
chip multiprocessors.
Unfortunately neither of these solutions address the need to
process the output bitstream sequentially, a problem which we will address in detail
in Chapter 5, nor do they offer much in the way of experimental results on real
hardware.
One recent project that takes a different approach is the Stanford Imagine stream
processor [13]. This processor contains 48 FPUs and is intended to be as power efficient as a special purpose processor while retaining some level of programmability for
streaming media applications. So while it is not specifically tuned to the MPEG algorithm, it is also not a general purpose processor. Early work predicted that 720x48
pixel frames would encode at 105 frames per second. Recent experimental work [2]
found that the prototype hardware was capable of 360x288 pixel frames at 138 frames
per second and made no mention of performance on 720x480 data. The reader should
be warned, however, that comparing absolute performance between different studies
can be misleading as there are many parameters in any MPEG encoder, particularly
the method and search window for motion estimation, that can dramatically improve
or degrade performance.
Analog Devices recently described different methodologies for using their Blackfin
dual-core processor for MPEG encoding[20].
The implementations are highly archi-
tecturally dependent, however, and not particularly scalable or generalized.
With Raw we have the advantage of very small latencies between general purpose
22
processing elements.
If we can successfully combine this with a methodology that
minimizes shared data and breaks the dependencies in the MPEG algorithm, we can
expose the spatial parallelism in the encoding of each individual MPEG frame without
needing to exploit any temporal parallelism that would limit the application space.
In this thesis we will demonstrate that real-time on-line encoding is possible on a
single-chip general purpose multiprocessor such as Raw.
23
24
Chapter 4
MPEG on the Raw Microprocessor
This chapter gives an overview of the Raw microprocessor and discusses the existing
condition of the code base when work began on this thesis.
4.1
The Raw Microprocessor
The Raw microprocessor is a single-chip tiled-architecture computational fabric with
enormous on-chip and off-chip communications bandwidth [23]. The prototype Raw
chips divides 122 million transistors into 16 identical tiles, each tile consisting of:
* an eight-stage single-issue RISC processor;
" a four-stage pipelined FPU;
" a static network router (routes specified at compile time);
" two dynamic network routers (routes specified at runtime);
" a 32KB data cache; and
" a 96KB instruction cache.
The architectural overview in Figure 4-1 shows how the tiles are connected by
four 32-bit full-duplex networks [24]. The cleverness in Raw is that each tile is sized to
match the wire delay of a signal traveling across the tile. Since each tile only connects
2.5
to its (at most) four neighbors, this means that the clock speed for the chip can scale
alongside the individual tile without regard to the number of tiles in a future version
of the chip.
1r24
C
C
S
S
C
C
C
S
C
S
z264
r26
S
S
S
S
CACHE
)r27
C
S
network"
Input
Compute
FIFOs
Pipeline
frma
SE
Outps
FOr
to
Static
Router
Static
Rote
Router
S
CS
Oc cle
PC
IEEI
Figure 4-1: The Raw Microprocessor.
4.2
Porting MPEG to Raw
When work on this thesis began, Hank Hoffman had managed to successfully port the
MPEG code to the Raw hardware. This initial code base, which will be referred to
as the baseline code in this thesis, was able to take advantage of multiple tiles and
created a valid MPEG-2 bitstream. It still had a number of scalability shortcomings,
however, that would be addressed as part of this research.
First, while the baseline code was able to execute most of the algorithm in parallel
(most importantly motion estimation) when it reached the stage of creating the output
bitstrean, it was forced to move sequentially through the tiles to create the bitstream
in a linear fashion. This was primarily because there are natural dependencies in the
MPEG bitstream - tile 1 could not start creating the bitstream from its data until it
had received some parameters from tile 0 regarding (for example) the current state
of the feedback loop for rate control. The primary goal of finding a way to break this
dependency and implementing it is described in Chapter 5.
The next barrier to scalability was that the baseline code was not able to partition
the problem to make use of all available tiles. More than one tile could be used per
26
j
0
0
1
1
0
0
0
1
1
2122
2 2
0
0
1
1
22
0
0
1
1
22
2
0
0
0
1
1
1
22
6 16 16
6
7
7
7
7
7
8
8
8
8
818
8
8
8
8
9
9
9
9
9
9
9
9
9
10 6 10 6 10 6 106
106
106
106
106
12
12
12
12
12 112
12
13
13
13
13
13
13
1 141
14
6
6 16
414
7
13
0
1
2
0
0
01010
0
1
1
1
2
222
1
1
22
0
0
1
0
1
212
2 2
6 16 16
6 6
6 16
6
6
6 6
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
818 _8
8
9
9
9
9
9
9
9
9
9
9
9
10
6
10
6
10
6
10
6
10
6
610
610
610 6 10 6 10 6 10 6 10 6 10 6 10 ]
12
12
12
12
12
12
12
12
12
13
13
13
13
13
13
13
13
13
141414
14
14
14
14
14
14
14
14
6 16
7
7
6
7
6
7
9
9
12
12
112
12
12
13
13
13
13
13
13
14
14
14
12
1
4
Figure 4-2: 352x240 pixel video frame mapped to a 15 tile Raw machine.
row, but if more than one row was on a tile then it had to be the whole row. For
example, if the input file was 352x240 pixels (or 15 rows of 22 Macroblocks each) then
the problem would be partitioned so that one tile handled each row of Macroblocks.
This is fairly good use of resources - it only leaves one tile idle. But if this same
example were run on 8 tiles, only 5 tiles would get used (3 rows on each tile) leaving
close to half the resources idle. Chapter 6 describes a method for partitioning that
applies as much of the available resources as possible to the problem.
Lastly, for each frame that was encoded, there was a data synchronization that
had to occur so that all tiles were seeing the necessary reconstructed frame data.
Because motion estimation looks in a 16 pixel region around each macroblock, there
is reconstructed frame data that is resident on other tiles. A simple example is the
case of mapping a 352x240 pixel video frame onto 15 tiles shown in Figure 4-2. In
this case, each row of 22 macroblocks is resident on a different tile. But when tile 4
does its motion estimation, it will need data which is resident on tiles 3 and 5. This
border data must therefore be synchronized every frame.
Using the static network of Raw to synchronize large chunks of data involves
very careful choreography of the network routes between tiles. The simplest way
27
to accomplish this was to have every tile read in, store and synchronize the entire
video frame. This is quite wasteful, however, as can be seen from the earlier data
map. In order to make the implementation scale to large numbers of tiles, it would
be necessary to come up with a methodology for routing just a subset of the data in
each frame. This "border routing" is outlined in Chapter 7.
28
Chapter 5
Parallelizing MPEG on Raw
Extracting much of the spatial parallelism' from the MPEG algorithm is fairly straitforward. 'This is due to the fact that the algorithm primarily operates on one macroblock of the image at a time.
Since a single frame of video typically contains
hundreds of macroblocks, the obvious method is to divide up the macroblocks evenly
among however many tiles you have available. This simplistic approach to parallelism
works for almost all phases of the algorithm. To see where it breaks down we need
to take a closer look at the inner workings of MPEG.
5.1
A Parallel Encoder
The primary differences between the parallel MPEG encoder shown in Figure 5-1 and
the more traditional single processor version in Figure 2-1 are:
" The incoming image data is split up between the tiles, so that only a portion
of each picture is actually being fed into the input frame buffer on each tile.
" Each tile is creating a serial bitstreaim which is only a portion of the output
bitstream, and which miuch be concatenated in a post-processing step.
'By spatial parallelism we are referring to parallelism that can be extracted in the compression of
a single frame of video data. This is in contrast to temporal parallelism, which can only be extracted
by allocating frarnes to differen t processiing elements, and wlhich cannot be used to perform truly
on-line real-time encoding [22].
29
* There is an additional Tile Synchronization stage required before the reconstructed picture can be placed into the Reconstructed Frame Buffer. This is
necessary to properly perform motion compensation since each tile needs to see
not only its own data but also data in a 16 pixel region around all of its data,
and some of this region may be resident on other tiles.
5
Uncompressed
Vide (YUV)
Inu
Rate Control
Fa:
.DCT
-bQuantize
---
VLC
Inverse
2p ,
Predicte
Quantize
Frameea
Inverse
Estimation
Bitu eram -+
Conctenation
Bitstream Buffer 2
MPEG
Video
7
DCT
Compensationuffe
Frm1ufrOtu
Predicted Frame
-r-
Motion Vectors
Tile
Syncronization
3
Motion
Compensation
Reconstructed
FrameBuffer
Output
BitstreamBuffer n
Figure 5-1: A parallel MPEG encoder.
The phases of the algorithm which roughly correspond to modules in the MSSG
code have been shaded and numbered. Throughout the rest of this thesis, we will
often refer to the phases by their module name, and so they will be outlined here.
1. memCpy - Although a final implementation would be accepting data from a realtime data source such as a camera, we are approximating this data movement
phase using a standard memcpy of the image data to a data structure local to
each tile.
30
2. motion - Motion estimation and generation of the motion vectors.
3. predict - Motion compensation and creating of the prediction.
4. transform - Includes the subtraction of the predicted frame (if any) and the
forward DCT of the resulting data.
5. putpic - In this module, the transformed data is quantized and run through
a variable length coder (VLC). The quantization factor is continuously varied
based on the kind of picture (I,BP), the bits that have been allocated for the
GOP and the current buffer fullness.
6. iquant - This is simply the inverse of the quantization step.
7. itransform - Inverse DCT and addition of the predicted frame (if any).
8. sendrecv - Reference frame data synchronization between tiles.
5.2
Conversion to Arbitrary Slice Size
As outlined earlier, in the case of the baseline code, the only options for partitioning
the incoming data are either an integer number of rows per tile or an integer number of
tiles per row. The MSSG code appeared to adhere to a convention that a, single row of
mnacroblocks was equivalent to a slice. The combination of these two factors required
the rate control function for a single slice to potentially span multiple tiles. This in
turn required a number of parameters related to the state of the rate control function
to be passed from one tile to the next in sequential fashion, effectively sequentializing
the putpic phase of the algorithm. In addition to this, since MPEG is a bitstream,
each successive tile needs to know the bit location that the previous tile last wrote
before it can proceed. Both these issues must be addressed in order to break this
dependency.
The effects of this dependency on a real test run of the baseline code are shown
in Figure 5-2. Here it can be seen that the first four phases of the encoding proceed
31
in parallel, and that the sequential aspect of putpic is a significant portion of the
runtime. Clearly as we scale to more tiles, this will become the single limiting factor
in performance. It should be noted that a similar recognition of the sequential aspect
of this phase can be found in the literature [16, 11, 20].
TileO
Tilel
MI
Tile2
|0 I
-N
W
Tile3
Tile4
Tile5
Tile6
Tile7
MEN
Tile8
Tile9
MHnd
TilelO
Tilel11
. . .. . . . ...
Tilel2
Tile13
Tilel4
20%
0%
60%
40%
0 memcpy
4
*
III
motion estimation
predict
transform
*
E
80%
100%
iquant
itransform
wait
sendrecv
N putpic
Figure 5-2: Individual tile activity for a baseline encoding.
The seminal realization in breaking this dependency was that slices could be an
arbitrary number of macroblocks [7, 9]. This, in turn, assured that a slice never had
to span tiles, and obviated the need for passing rate control parameters. Because
slices are by definition word aligned, it also meant that the output of each tile no
longer had to be concerned with the bit-level alignment of any other tile.
It is interesting to note that the purpose of the slice in the MPEG standard is
32
primarily for error recovery [7].
A decoder can drop a slice, for example, without
dropping an entire frame. For Raw, however. the slice becomes the ultimate parallelization construct. In effect it is an arbitrarily sized (at least to the granularity of
a single macroblock) sub-frame which can be allocated its own bit-budget for rate
control purposes2 .
One subtlety that was passed over in the previous discussion is that although slices
can be of arbitrary size, they are actually not allowed to span rows of rnacroblocks 3.
This is not a. difficult exception to make; it simply requires that any tile handling a
slice that spans multiple rows must make sure that it terminates slices at the end of
the row and creates the proper headers at the beginning of rows. This, in turn, will
typically lead to a slightly larger output file size.
We can now create a new mapping for a 352x240 pixel example running on 16
tiles. 'This is shown in Figure 5-3. Each block represents a macroblock of the frame
and the nurmber is the tile which is responsible for that macroblock. Separate slices
are denoted by heavy outlines.
Figure 5-4 shows the result of parallelizing the putpic phase. All phases are now
running in parallel, and for Tile 9 there is no idle time at all. For all other tiles,
the idle time is simply the difference between their motion estimation phase and
the same phase on Tile 9. Because the rest of the algorithm even after this phase
can still proceed in parallel, this idle time does not actually show up until before
sendrecv. All tiles must participate in data communication, so this phase acts as a
kind of barrier at the end of each frame.
2Although this usage of the slice is clearly allowable under the ISO standard [9], not all decoders
will properly interpret the resulting bitstream. In particular. the MSSG decoder does not handle it
correctly. Once we moved to this construct, we were forced to use a more robust commercial grade
decoder such as CyberLink PowerDVD for verification.
3
This is actually a contradiction between the ISO standard [9] and the popular literature [7], the
latter of which states that a slice can be "as big as the whole picture". Interestingly, the MSSG
decoder does properly decode bitstreams which contain slices that span multiple rows.
33
0
0
0
0
0
0
0
1 11 11
0
0
1 1
0
0
1
1
0
0
0
1
0
0
1
1
0
0
0
0
0
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
110
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
101 11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11 112
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13 114
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14 115
15
15
15
15
15
15
15
15
15
15
15
15
15
15
Figure 5-3: Slice mnap for 352x240 video on 16 tiles.
34
TiIeO
Tilel
TiIe2
TiIe3
..
.....
m............
m......
.. . . m
..s e mmmmmm..
H 1! 1! 1
TiIe4
TiIe5
TiIe6
TiIe7
TiIe8
TiIe9
nn .. ..nn .... n...u n.........~......n.........
.........-
Tile1O
Tilel1
Tile12
Tile13
TiIe14
0%
20%
U||
memcpy
motion estimation
predict
transform
40%
60%
80%
iquant
itransform
EU
wait
sendrecv
putpic
Figure 5-4: Individual tile activity after parallelizing putpic.
35
100%
36
Chapter 6
Partitioning MPEG on Raw
Simply solving the parallelization problem is not sufficient foi a truly efficient inplementation of MPEG on Raw. By default, the MSSG code operates primarily at the
inacroblock level, which is to say that most of the loops proceed sequentially through
all the macroblocks in a frame.
In creating the baseline parallel implementation,
however, these loops were changed to be more row oriented. Now that we had a parallel methodology that would allow placement of an arbitrary number of macroblocks
on each tile, it was necessary to go back through the code and reorganize it on a
macroblock level.
The first step in doing this was to create a generic partitioning algorithm that
would work for any size image and any number of tiles. This involved:
* Calculating the number of macroblocks that each tile would be responsible for
(mb-most-tiles);
* Given the above, then calculating the maximum number of tiles that could be
applied (tiles-used);
" Given that, calculating the number of macroblocks (if any) that must be handled
by the last or "remainder" tile (mblasttile).
" Finally, creating an "ownership map" data structure which simply held the
identity of the owner for a given macroblock.
37
Since each tile must "own" an integer number of macroblocks. it will not always be
possible to perfectly partition the frame (and thus the task). The 'best" partitioning
would be achieved when most of the tiles own the fewest number of macroblocks
possible such that the last tile owns fewer macroblocks than the others. If totalmbs
is the total number of macroblocks in a frame and tiles is the total number of tiles
available, this proceeds as follows:
1. mbrmostiiles =
Im"al
2. Find the largest tiles-used s.t. tiles-used . mbbmostiles < mb-total
3. mb-last-til
=- mblot.al - tidcs-used - nb-mostitiles
Table 6.1 shows how the MPEG encoding is partitioned on four example frame
sizes for a wide range of available tiles. As the table shows, as long as the number of
macroblocks per tile is larger than the number of available tiles, the algorithm will
always use all available tiles. When the number of macroblocks per tile gets smaller
than the number of available tiles, it gets more difficult to use all available tiles.
Although the table only shows available tiles that are a power of two, the partitioning
algorithm will create the optimal partitioning for any number of available tiles.
Tiles
352x240
used I MBs I last
640x480
used I MBs I last
720x480
used I MBs [last
1
2
4
8
16
32
64
128
256
1
2
4
8
16
30
55
110
165
1
2
4
8
16
32
64
120
240
1
2
4
8
16
32
62
122
225
330
165
83
42
21
11
6
3
2
0
0
81
36
15
0
0
0
0
1200
600
300
150
75
38
19
10
5
0
0
0
0
0
22
3
0
0
1350
675
338
169
85
43
22
11
6
0
0
336
167
75
17
8
8
0
1920x1080
MBs last
I used
1
2
4
8
16
32
64
128
255
8160
4080
2040
1020
510
255
128
64
32
0
0
0
0
0
0
96
32
0
Table 6.1: Macroblock partitioning for a variety of available tiles and problem sizes.
The discussion in Chapter 5 does expose the one weakness in the partitioning
methodology. Since most tiles have to wait for the slowest tile to finish before they
38
can proceed to sendrecv, the time it takes to do motion estimation on the slowest
tile becomes the "weakest link in the chain".
For our examjples the effects are not
bad, but if one were to use a larger search space for motion estimation, this would
increase the percentage of time spent in this phase and the difference between slowest
and fastest tile would also increase. More serious could be a particular video signal
which included fast moving objects in just one area of the screen. This might bog
down motion estimation for a particular tile as it continually needs to search large
areas to find best matching blocks.
One solution which might be effective would be to continually monitor the time
it takes for each tile to do motion estimation and vary slightly the number of miacroblocks assigned to each tile as a result. Tiles which are responsible for regions with
more activity would naturally be assigned fewer macroblocks. while those in low activity regions would get more. This dynamic "load balancing' is another advantage of
the ability to have arbitrary slice sizes. The disadvantage of this methodology is that
it takes the communication pattern from being static to being dynamic which, as we
will see in Chapter 7, would require new hardware support to accomplish efficiently.
39
40
Chapter 7
Making Communication Efficient
The bulk of the effort in this thesis revolved around improving the efficiency of the
communication between tiles. Although the greatest performance gains came from
parallelizing the putpic phase up to 16 tiles, communication becomes the limiting
factor as MPEG encoding on Raw is scaled to 64 or more tiles.
7.1
Improvements to memcpy
The most obvious inefficiency in the baseline code, and the easiest to address, was the
fact that the memcpy phase loaded the entire frame into every tile. The inefficiency
here is obvious, and is particularly acute as the number of tiles increases. The solution,
however, is unfortunately not as simple as only bringing in the macroblocks that a
particular tile "owns". In order to perform motion analysis, it is also necessary for
each tile to have access to data all around the border of its macroblocks'.
Part
of the initialization phase was therefore a border discovery where each tile goes
through its inacroblocks and determines the identity of the macroblocks on its border.
Figure 7-1 shows an example of the border (in gray) around tile 3 in the 352x240
example on 16 tiles.
The size of the border for a given tile can be up to 2,n+ 6,. where n is the number
'Akramullah[15]
describes this as "overlapped data distribution"'. However, because this study
was conternplating processing elements vith much higher commnimcation Iatencies than Raw. this
method was not sufficient to provide good scalability with a large number of processing elements.
41
0
0
1
1
-
3
0
0
0
0
0
1
1
1
1
1
1
1
2
2
2
-
2
'
3
3
3
13 1
3
1
13
1
1
0
0
1
3
3
0
0
1
1
1
2
2'
3
3
0 0 0 0
1
1
1,'
'Z
1
,-.
0
0
0
0
1
2
1
1
1
1. 1
.2
2
2
3
3
3
3
13 3
13
13
1
5
5
5
3 3 13 3 3
1
0
13
-
1
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
616
6
6
7
71717
7
7
7
7
7
7
7
7
7
7
7
7
7
7
717
7
8
8
8
81818
8
8
8
8
8
8
8
8
8
8
8
8
8
818
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
11
11
11 111
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
12
12
12
12
12
12
12
12 112 112
12
12
12
12
12
12
12
12
12
12
12
13
13
13 113
13
13
13
13
13
13
13
13
13
13
13
13 113
13
13
13 113
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
15
15
15
15
15
15
15
15
15
15
15
15
15
15
14
15
6
Figure 7-1: Bordering macroblocks required for motion estimation.
of macroblocks on that tile. This can still represent a significant decrease in the
amount of data that must be transferred. If we consider the 352x240 example on
16 tiles again, Table 7.1 compares the total number of macroblocks that need to be
transferred with and without border discovery. With border discovery, over 7 times
less data needs to be brought into the tiles each frame, and this translates directly to
the performance of this phase.
7.2
Improvements to sendrecv
The baseline code accomplished the sendrecv phase by having each tile send the data
it owned to every other tile. We will call this full routing. The net result of this
was that by the end of the sendrecv phase, every tile had a complete version of the
reference frame in its reconstructed frame buffer. The problem here is analogous to
the problem with the original memcpy, which is simply that the only data that actually
needs to be in reconstructed frame buffer for a given tile is the data corresponding
to the macroblocks that it owns as well as the border around those macroblocks.
However, since these border macroblocks are distributed around the tiles, the solution
of border routing is significantly more difficult to achieve even for a single problem
42
Full Frame
Border Onlyv
0
1
2
3
4
5
6
7
330
330
330
330
330
330
330
330
23
44
46
46
46
46
46
46
8
330
46
9
10
11
12
13
14
15
Total MBs
330
330
330
330
330
330
330
5280
46
46
46
46
46
38
17
720
Tile
Table 7.1: Number of macroblocks required on each tile in full frame memcpy compared
to memcpy with border data.
size let alone for the generalized problein.
7.2.1
Full Routing and Elemental Routes in Raw
It is useful to understand how full routing works on R-aw in more detail so that one can
better understand the gains that may or may not be achievable with border routing.
First we must look again at the architecture of the Raw microprocessor.
A 16-tile
Raw microprocessor is arranged as a 4x4 grid, and each tile has both a compute
processor as well as a switch processor for routing data. Each processor has its own
instruction set and each runs separate code which must be carefully synchronized by
the programmer. In the figures, we will represent the compute processor as a small
box in the upper left of the tile with the switch processor in the lower right.
As Figure 7-2 shows, in order for Tile 0 to send its data to all the other tiles, each
tile must not only be ready to receive the data but also ready to route the data as
required depending on the identity of the sending tile 2 . After Tile 0 has sent all of
2
1f any tile is not ready, the other tiles will stall. If one or more tiles loads the wrong route into
43
rx W t
t x E
rx
%tx N
i
ES
rx W
x
tx E
k
rx W
rx W tx E k
rx W
rx W
rx W
r
E k
k
x
W t x E
71
rx N tx ES Ic
rx N tx E
k
rx W tx E k
rx W tx E k
I
tx E k
rx W tx Ek
r
Figure 7-2: Example routes for Tile 0 sending data to all other tiles.
its data, each tile will load a new route into its switch and Tile 1 will begin sending.
This process will proceed until all tiles have been senders.
It can be seen from Figure 7-2 that the sending of data from Tile 0 to the other
tiles requires a total of one sending route (labelled _txES) and four receiving routes
(labelled _rxN-txES-k, _rxN_txE-k, _rx-W-txE-k, and _rxW) 3 . Raw has the ability
to send to up to four destinations at once or to route to three destinations while
receiving. This leads to 15 elemental sending routes as well as 15 elemental receiving
routes from each of four directions, for a total of 75 possible routes. The 15 sending
routes are shown in Figure 7-3, while the 15 receiving routes from the East are shown
in Figure 7-4. The 15 receiving routes from the North, South and West are similar
and can be easily inferred.
the switch, then the entire Raw processor will deadlock. This is a source of much debugging fun!
3
Our labelling convention for routes is as follows:
" If the tile is a receiver, start with _rx and the direction from which data is being received;
* If the tile is a sender, add _tx and the direction(s) to which data will be sent, starting with
N and going clockwise;
* If the tile is a router (a receiver and a sender) and the data being routed will also be delivered
to the processor, add a -k (for "keep").
44
-tx
W
tx_NE
_tx_NS
-txNW
-txEW
tx SW
txNES
tx NSW
txESW
-txNEW
txNESW
Figure 7-3: Possible routes for sending on Raw.
45
rx_E
rx_ Etx
_rx_E-tx
W
_rxE txNk
_rxE_txNW
rx -E-tx NS k
S
_rx_Etx W-k
_rx_EtxSW
xE
tx NWk
_rx_E-txNSW
_rx_E-zxS-k
_rx_E_tx_N
_rx_E-txSW-k
N_rx_E-txNS
_rx E tx NSW k
Figure 7-4: Possible routes for receiving from the East on Raw.
46
The first step in actually implementing either full or lorder routing was to create
the switch code for all 75 of these cases. For our purposes. the number of bytes that
would be sent on a gfiven route was not known in advance, and so it was necessary to
create switch code that allowed for routing of an arbitrary sized block of data. The
following code segment accomplishes this for the "send east" route:
_txE:
move $0, $csto
bnezd- $0, $0,
nop
.
route $csto->$cEo
route $0->$csti
j Similarly, the following code segment will cause the switch to receive data from
the East, route it to the North, South and West and deliver it to the processor:
_rx_E_txNSWk:
move $0, $csto
bnezd- $0, $0,
nop
.
route $cEi->$csti,$cEi->$cNo,$cEi->$cSo,$cEi->$cWo
route $0->$csti
J The previous code fragments must be called at the appropriate time by the main
processor code. In order to determine which of the 75 routes a given tile should run,
case statements are used to set the function pointer. For example, the case statement
for the send function in the case of full routing is:
static void set-sendjfn(void (**f)(void)) {
switch(tile-id) {
case 0:
*f = _txES;
break;
case 3:
*f =
_txSW;
break;
case 12:
*f = -txNE;
break;
case 15:
*f = _txNW;
break;
case 1:
case 2:
*f = _txESW;
break;
case 4:
case 8:
*f = _txNES;
break;
case 7:
47
case 11:
*f = _txNSW;
break;
case 13:
case 14:
*f = _txNEW;
break;
case 5:
case 6:
case 9:
case 10:
*f = _txNESW;
break;
default:
rawtestfail(OxOOO);
break;
}
}
The route is selected and the switch code is invoked as follows, where each of the
lw instructions is essentially loading one word of the macroblock into the switch:
set-sendjfn(&switchjfn);
SWPC,%0" : : "r" (switchjfn));
__asm__ volatile ("mtsr
counter = mb-per-tile*64-1; /* number of Y words to send */
__asm__ volatile ("move $csto,O" : : "r" (counter));
__asm__ volatile ("lw $csto,%0" : : "m" (pEO));
41
__asm__ volatile ("lw $csto,%0" : : "m" (p[ ));
__asm__ volatile ("lw $csto,%0" : : "m" (p[81 ));
__asm__ volatile ("1w $csto,%O" : : "m" (p[12]));
The case statement for receiving is somewhat more involved since the appropriate
function depends not only on the identity of the receiving tile, but also the sending
tile. The concept, however, is the same.
7.2.2
Border Routing
The first step in border routing is similar to the border discovery of the improved
memcpy.
We calculate a receiver opcode which is a bitfield containing as many
bits as there are potential receivers. As a result of the partitioning described earlier,
we already have an ownership map for each inacroblock which also can be used to
determine the sender of a given macroblock. With these two pieces of information,
each macroblock will have attached to it a route specifying where it is coming from
and all the places it is going. This route can be specified as {scndcr, receiveropcodc}.
48
W
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
41414
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
6
5 5 515 5_
6
6 6 6
6
6
6
6
6
6
6
5 5
6 6 6
6
6
6
6
6
6
7
7
7
7
7
77
7
7
7
7
5
7
7
7
7
7
7
7
7
7
7
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
111
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
11
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
141
13
13 1 3
13
13
13
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14
14 114
14
14
15
15
15
15
15
15
15
15
15
15
15
15
15151
14:
Figure 7-5: Determination of the receiver opcode when Border Routing.
Filling in the receiver opcode bitfield is a matter of an initialization function which
walks through each macroblock and sets bit n if Tile n is a receiver. In Figure 7-5,
Tile 3 needs to send its first macroblock to tiles 1, 2 and 4. This translates into
a receiver opcode of 10110 or 1616. Tile 14 on the other hand has 6 macroblocks
that only need to go to Tile 13, and therefore have an opcode of 10000000000000 or
200016. Table 7.2 shows part of the route map for the example of 352x240 on 16 tiles.
The full route map has 45 entries, which translates into 45 unique routes that must
be programmed in order to border route the macroblocks. An example of the route
{3, 3416} is shown in Figure 7-6.
On the surface, there would seem to be a lot to gain from border routing. After
all, since the majority of tiles only actually need to send data to a few neighboring
tiles, there should be a lot of wasted communication to those other tiles that can be
eliminated. However, the architecture of the Raw network is such that delivering an
operand to a processor is a zero cost operation if that tile is also involved in routing.
In the example route of Figure 7-6, Tiles 0 and 1 are not receivers, but if they were
the route would not run any slower. We should be able to quantify the advantage of
border routing by comparing the theoretical number of cycles it should take to do a
49
Tile
0
0
IL
3
3
3
NIBs
0-18
19-20
21-39
40-41
42-43
44-60
61-62
63-64
65-81
82-83
14
14
14
14
15
15
294-295
296-307
308-313
314
315-316
317-329
1
1
21
2
0
1
1
1
1 2
1
1
10
31415161 7819
1112113114115
opcodel6
2
6
5
1
1
1
1
1
1
1
1
1
1
1
1
D
B
A
IA
16
14
34
1
1
1
1
1
1
1
1
1
1
16
1
Table 7.2: Partial border routing map for 352x240 video on 16 tiles.
1-1
_rx
E tx S
-
1-1 1
1
4
tx W
rx E t
rx MS
E tx
x W k
4
rx N
El
31
rx N
17
idle
idle
idle
idle
idle
idle
idle
idle
idle
idle
11
Figure 7-6: Example routes for Tile 3 sending to Tiles 2, 4 and 5.
50
BOOO
AOOO
2000
AOOO
6000
4000
full routing of the 352x240 example with border routing of the same data.
The Rav network requires three cycles to communicate an operand between neighboring tiles: one to get from the processor to the switch in the first tile, one to get
from the first tile switch to the neighbor tile switch, and a third to get from the second switch to the second processor. However, once a pipeline is setup, operands can
be delivered every cycle and so if the size of a data block being transferred is large
enough (as it is in this case) the startup time of the pipeline can be safely ignored.
Since a macroblock is the same size regardless of the kind of routing being performed,
and since all we are looking for is the relative speedup of border routing compared
to full routing, we can also factor out the number of cycles required to emit a single
macroblock from one of the tiles. This leaves us to focus on the product of the number
of network hops between the sender and the furthest destination for each route and
how many macroblocks follow that route.
Making this calculation for the full routing case is fairly simple, since there is only
one route for each tile. The four corner tiles are 6 hops from the furthest destination,
and the four center tiles are 4 hops. All other tiles are 5 hops. All the tiles have 21
macroblocks except the last tile which has 15. So the total number of macroblock. hops
is 21(3 - 6 + 4 - 4 + 8 - 5) + 15 - 6 = 1644.
The calculation for border routing is a bit more involved since there are 45 routes.
While full routing does not have any routes that are shorter than 4 hops, border
routing does not have any routes that are longer than 4 hops and has a large number
that are only one hop.
Table 7.3 4 compares full routing with border routing by
showing the number of macroblocks that have to travel a particular distance in each
case.
The net result for border routing is a total of 752 mnacroblock - hops.
This
means that the most one could expect to improve the sendrecv phase using border
routing for this example is 1644/752 or about 2x if we take into account the additional
route creation overhead inherent in border routing. As we will see in Chapter 8, the
4
1t is interesting to note that although all the macroblocks in the frame (330) get routed in both
cases for this example. as the number of macroblocks resident on a single tile gets larger (either
because the image is larger or the number of tiles is smaller), a significant number of miacroblocks
may not need to be routed at all since they do not have a border that is resident on another tile.
51
Max Hops
1
2
3
4
5
ftVBs
Border
Full
IB - hops 1iBs
336
840
84
168
6
78
468
Total
330
1644
MB - hops
172
20
12
126
172
40
36
504
330
752
Table 7.3: Comparison of number of the network hops required to full route and
border route 352x240 video on 16 tiles.
experimental results bear this out, although they also expose a feature of border
routing that was riot anticipated. Unlike full routing, it is not necessary for all tiles
to have entered the sendrecv phase before routing can begin. The effect of this is that
some of the wait time before sendrecv is hidden. Unfortunately, because there is still
an explicit barrier at frame boundaries, this does not. have a true performance benefit
in the current implementation. If a future implementation was to dynamically allocate
varying numbers of macroblocks across tiles as described at the end of Chapter 6,
however, this feature could in fact be useful.
7.2.3
Border Routing on Larger Raw Machines
One of the big problems with border routing is that it removes much of the generality
that was possible with full routing. This is because the required routes are a function
of the way a particular frame size maps to a, particular number of tiles. Fortunately,
the route set for 352x240 on 16 tiles is a superset of what is needed to border route the
same size frame on less than 16 tiles. Still, the loss of generality with respect to frame
size is probably not worth
just a 2x improvement in this one phase of the algorithm.
The real advantages of border routing do riot become truly apparent until we start
using a larger number of tiles.
Unfortunately, at this point the number of unique
routes quickly becomes unmanageable, at least if the coding of these routes is being
done manually. In the end, we decided that it was important to implement border
routing on a 64 tile example in order to prove that Raw was capable of real-time
52
Tile
I
0
MBs
[10 1
0-19
1
1
1
1
2
2
2
2
3
3
3
3
3
20
21
22
23-41
42
43
44
45-63
64
65
66
67
68-85
86
87
4
88
4
4
4
4
89
90-107
108
109
o
1
2 3
1
1
1
4 15
6
7
4
C
E
9
8
18
1
1
1
1
1
1
1
1
1
1
1
1
1
1
IC
1
1
1
1
1
1 1
1 1 1
1
1
1 1
1
1
1
1
1 11
1 1
opcode1 ,
1
1
1
1
1 1
1 1 1
12
11
31
39
27
23
22
62
72
4E
46
44
C4
E4
Table 7.4: Partial border routing map for 720x480 video on 64 tiles.
encoding and near linear scalability for DVD quality (720x480) video.
A 720x480 pixel video frame has 45 macroblocks in each of 30 rows for a total of
1350 macroblocks. The mapping of a 720x480 frame to a 64 tile Raw microprocessor
is shown in Figure 7-75.
Recall from the partitioning in Table 6.1 that only 62 tiles
are actually used and most tiles will have 22 macroblocks while the last will have 8.
This tile mapping will generate a routing map with 299 unique routes. The beginning
of the route map showing the routes for the first five sending tiles is in Table 7.4. The
routes for the higher number tiles would have very large (up to 62 bit) opcodes and
would make it difficult to properly code the massive case statement needed to set up
the routes.
The resulting route map has a two important characteristics that suggest some
optimizations, however. First, it turns out that a given tile never needs to commu5Because it is so dense, we have shaded the ownership blocks simply so that it is easier to pick
out the borders visually.
53
Figure 7-7: 720x480 pixel video frame mapped to a 64 tile Raw machine.
Tile
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
3
4
4
4
4
4
j
MBs
0-19
20
21
22
23-41
42
43
44
45-63
64
65
66
67
68-85
86
87
88
89
90-107
108
109
-3 1 -2 1-110
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
+3
+1+2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
01 /,iSC
ode(lO(
C16
32
96
112
36
32
96
112
36
34
98
114
39
35
34
98
114
39
35
34
98
114
Table 7.5: Partial border routing map for 720x480 video on 64 tiles with biased
opcode.
nicate with a tile further than three behind it or three in front of it (in absolute tile
numbering - this does not mean that the tile is not more than three hops away). This
allows us to create a relative or biased opeode instead of the absolute opcodes that
we used in the 16 tile case. This biased opcode only needs to be 7 bits long. When
this shorter opcode is calculated for all the routes, it exposes the second important
characteristic of the route map which is that when expressed in this form, the opcode
pattern repeats in a regular pattern6 . This fact would allow us to group many of the
tiles together in the case statements. The repeating pattern can be seen in tiles 2, 3
and 4 in Table 7.5 and continues through most of the remaining routes until the last
few tiles.
'There are a few boundary condition exceptions to this in the mapping. For example, the first
imacroblock of Tile 45 does rot border Tile 44 as would be "normal" FOr all the other tiles. However.
these exceptions are few and can be handled with special cases.
55
The basic format for the case statement that sets the routes for the receiving tiles
can therefore take the form of a nested case statement with the first switch based on
the sender and the second switch based on the biased opcode, as seen here:
switch (mbowner [k]) {
case 0:
switch(border-tilelo [k]) {
case 32:
*f = _rx-W;
break;
case 96:
if (tile id==2) *f = _rx_W_tx_E_k;
else *f =rxW;
break;
case 112:
if ((tile-id==1) II(tilejid==2)) *f
else *f = _rxW;
break;
default:
raw test-fail(0x019);
break;
=
_rxWtxE-k;
=
_rx_W_tx_E_k;
}
break;
case 1:
switch(border-tilelo [k]) {
case 36:
if (tile id==0) *f = _rxE;
else *f = _rxW;
break;
case 32:
*f = _rxW;
break;
case 96:
if (tile id==3) *f = _rxWtxE-k;
else *f = _rxW;
break;
case 112:
if ((tilejid==2) I(tile-id==3)) *f
else *f = _rxW;
break;
default:
rawtestjfail(0x020);
break;
}
break;
We are now in a position to calculate the theoretical advantage of border routing
compared to full routing for the example of 720x480 data on 64 tiles. The details of
the calculation are unimportant, but the results are summarized in Table 7.6. As the
table shows, when border routing this example on could expect to see a speedup of
14061/6001 or about 2.3x compared to full routing.
56
Border
Full
Max Hops
1
2
3
4
AJBs | MB -hops || AIBs
|IMB
- hops
611
95
1222
285
28
602
14
168
4214
112
1350
6001
5
6
7
8
9
10
11
12
13
14
Total
84
168
252
336
239
147
63
1350
672
1512
2520
3696
2868
1911
882
14061
Table 7.6: Comparison of niunber of the network hops required to full route and
border route 720x480 video on 64 tiles.
7.3
Alternate Macroblock Allocation
In the process of mapping a variety of problem sizes to a variety of tiles, we have
been consistent about simply mapping macroblocks to tiles in order.
That is, if
most tiles will be getting 22 macroblocks, then the first tile will get the first 22, the
second tile will get the second 22, and so on. While this simplifies the programming
in the generic case, it will seldom achieve the smallest possible total border size.,
and therefore the fastest routing of the border data. Doing this would require an
algorithm that attempted to allocate more square regions of macroblocks to each tile,
thereby making the perimeters smaller and assuring that for even a small number
of macroblocks per tile some of the border data will be owned by that tile. This
allocation would more closely match the physical layout of the Raw chip as well,
thereby reducing the number of hops.
An example of one such napping for the
352x240 on 16 tiles example is shown in Figure 7-8. When this data is added to our
earlier comparison, we get Table 7.7 which has a upper speedup bound for sendrecv
of 1644/257 or about 6.4x compared to full routing. While this is significant, it is
57
Border
Full
Max Hops |I
Bs
MB -hops
0
1
9
3
4
5
84
168
336
840
6
78
468
Total
330
1644
Sqluare
MAiBs I 11B -hops 11MJBs I MB hops
110
0
172
172
183
183
20
40
37
74
12
36
126
504
330
752
330
257
Table 7.7: Comparison of number of the network hops required to route 352x240
video on 16 tiles when using a more square macroblock allocation scheme.
also difficult to generalize and because of the other advances we have already made
in this phase, we face a situation of diminishing returns. Even so, this would be a
promising area for improvement when moving to a larger number of tiles.
The positive side of allocating macroblocks in stripes instead of more square regions, however, might be a natural load balancing of the motion estimation phase.
If the regions were more square, it may be more likely that one tile will be stuck with
a "busy area" and take significantly longer in the motion estimation phase. Doing any
more than guessing at the relative advantages of square regions compared to striped
regions would require significantly more research.
7.4
Suggested Improvements to Raw
As may be obvious from the previous discussion, there is one feature of the static network that would make algorithms that require synchronizing data at regular intervals
much easier to code, particularly as the number of tiles grows.
The giant nested case statement that was used to route data on a Raw mesh
containing 2k tiles could have been replaced by a series of opcodes consisting of four
elements:
" an ri - bit address pointing to a data block
* an n - bit value specifying the length of that data block
58
6
6
6
6
10
10
11
11
7
7
10
10
10
10
10
10
11
11
11
11
10
10
10
10
10
11
11
11
11
11
10
10
10
10
10
11
11
11
11
11
12 12
8
8
8
12
14
14
14
14
14
11
11
11
11
11
12 12
12
12
12
12
14
14
14
14
14
15
15 15
15
15
12
12
12
12
12
12
14
14
14
14
14
15
15
15
15
15
12
12
12
12
12 12
14
14
14
14
14
15
15
15
15
15
Figure 7-8: 352x240 pixel video frame mapped to a 16 tile Raw machine using a more
square macroblock allocation scheme.
" a k - bit value specifying the sender
" a 2 ' element bitfield specifying all the receivers
As it is, the synchronization stage acts much like a barrier. A new "data synchronization barrier" would simply make this explicit and, in the simplest case, cause all
the tiles to execute these opcodes in order.
While hardware support for this would be ideal, it could also be done in the
compiler or through library support. Either way, the best outcome would analyze the
routes and try to overlap them to the extent possible. For example, in the earlier 64
tile routing, most of the traffic is along pairs of rows. Even though the destinations
for a given tile are never more than +3 or -3 away, it is often necessary for tiles further
away to participate in the routing (for example, when tile 7 communicates with tile
8, the data passes through tiles 0 through 6). But if one were to start routing on tile
0, 16, 32 and 48 simultaneously one could achieve a further 4x speedup in the routing
phase.
59
A second improvement is related to accessibility of the second static network. The
sendrecv phase is simply a data synchronization between processing elements that
do not have access to a shared memory, and as such can take advantage of however
much bandwidth is available. One of the disappointments with Raw was to discover
that even though there are two entirely independent static networks on the chip, it is
riot possible to make use of both of thein for this purpose. Even though the switch
can route two operands each cycle and can deliver two operands to the processor each
cycle, it cannot accept two operands per cycle from the processor. The result in our
application is an idle second static network and a sendrecv phase that takes twice
as long as it could with this one architectural change.
60
Chapter 8
Experimental Results
Armed with the modifications to the standard MSSG code described in the earlier
chapters, we are in a position to show the absolute performance as well as the amount
of parallelism that can be achieved when encoding MPEG video on Raw.
For our testing we used three different sample video sequences. The characteristics
of these are summarized in Table 8.1. The first two sequences on which most of the
tests were run are of the same scene, which is a closeup of a hand performing a
poker chip trick[12]. The smaller of the two is just a subsamuplinrg of the larger. This
was done so that the contents of the scene could be factored out when comparing
performance. One frame of the scene is shown in Figure 8-1.
The third sequence is a, clip from a DHL advertisement which includes much finer
details[6].
It was selected specifically for the purpose of investigating whether the
contents of the scene had a significant effect on performance and also as a way of
visually comparing the image quality of the encoder. One frame of the scene is shown
in Figure 8-21.
The baseline data for all tests in this section are machine cycles, although they
are typically not presented in this fashion. In order to derive more useful units such
as frames/sec, a nominal clock frequency of 425MHz is assumed.
'It actually turned out
int erlaced format.. Because
to be quite difficult to find high resolution video sequences in nonof the generality of the encoder, it would be very straightiforward to
run further tests on a wider variety of sequences if and when they become easier to source.
61
Figure 8-1: The video sequences chips360 and chips720.
Figure 8-2: The video sequence dhl.
62
Name
chips360
chips720
dhl
Frame Size
#
352x240
720x480
640x480
Frames
30
30
30
Total Macroblocks
330
1350
1200
Table 8.1: Characteristics of the video files used for testing the MPEG encoder.
8.1
Phase Speedups
An analysis of the performance of the various phases of the MPEG algorithm in Figure 8-3 make it very clear that the parallelism that can be extracted from the bulk
of the code has been extracted. The only remaining areas impacting the scalability
of the code involve conmnuication in the initial loading of the frame data into the
tiles, namely memcpy, and the synchronization of the reference frame in sendrecv.
rhe memcpy phase does exhibit some speedup, and continues to do so up to 64 tiles.
The sendrecv phase, on the other hand, does not make sense to even show a speedup
curve for since the phase does riot exist in the baseline single tile case, and actually
requires more cycles as we add tiles, even with the addition of border routing.
8.2
Absolute Performance and Speedup
If there is a single image that communicates that we achieved the goals set out in
the Introduction, it is Figure 8-4.
In this graph, the bulk of the improvement in
performance over the baseline code due to parallelization of the putpic stage is in
evidence even in the full routing case. The well partitioned nature of the code is
shown by the fact that we have constantly improving performance for any number of
tiles 2 . And although the scalable nature of the code is more in evidence in later 64
tile runs, the advantages of border routing on performance are already considerable
even here.
A similar graph for the case of the 720x480 sequence is in Figure 8-5. Overall.
2
There is a noticeable trough in the curve from 12 to 14 processors. This can be explained by the
fact that these partitionings have a remainder number of macroblocks on the last tile. and therefore
are not being Fully utilized like the samples on either side (II tiles is exactly 30 macroblocks per tile
and 15 tiles is exactly 22 macroblocks per tile).
63
M
16 ,
15 14 13 12 11 10 987654321-
enmcnv
*1
/
CI
CD,
/
/
71 1
1
1 2 3 4 5 6 7 8
mo i on
i ion
estmat
#
1 1 1II 7T1|
9 10 1112 13 141516
of Tiles
16 -l predict
10 14 13 -j
12
1 2 3 4 5 6 7 8 9 10111213141516
*
1 2 3 4 5 6 7 8 9 10111213141516
transform
16
15
14
13
12
11
putpic
10
9
8
TI
1 2 3 4
5
6 7 8
9
6
5
4
3
2
1
10111213141516
1 2 3 4 5 6 7 8 9 10111213141516
quant
transform
16 15 14 13 12 11 10 9a7654321-
e /*
1 2 3 4 5 6 7
8
1 2 3 4 5 6 7 8 9 1011 1213141516
9 10111213141516
Figure 8-3: Speedup of individual phases of the MPEG Algorithm for a 352x240
sequence on 16 tiles.
64
16 -
15 -
14 -
60
13 -
12 -
50
11
10
-40
9
U)
0)
(D9
0)
E
C,)
7
-30
6
5
20
4
310
2
1-
I
1
2
3
4
I
5
I
6
I
8
7
I
9
I
10
I
11
I
12
13
14
I
15
- 0
16
# of Tiles
o
Unear
0
0
Border Routing
Full Routing
A
Baseline
Figure' 8-4: Speedup and absolute performance for a 352x240 sequence on 16 tiles
using hardware.
65
speedup is even better as the resolution increases since the communication phases
constitute a smaller percentage of the overall runtime and the "worker pliases" such as
motion estimation consume a larger percentage. As one example, motion estimation
is 44% of the runtime in the 16 tile 352x240 case. and 48% of the runtime in the 16
tile 720x480 case.
As we have mentioned many times, the real barrier to scalability above 16 tiles
is communication. This is denionstrated well by the full routing curve in Figure 86 which shows that performance at 55 tiles is only half the theoretical maxiinmu.
Border routing was earlier demonstrated as the means for dealing with this. However,
actually writing the border routing code for 64 tiles was not only prone to error, but
extremely difficult to debug since the 64 tile simulator required 6 hours simply to
boot and enter main() in the routing test program.
Fortunately, we can apply an analysis such as the one developed in Section 7.2.2
to determine the theoretical advantage of border routing for this case. Since we have
cycle times for every phase from the full routing case, and since the only difference
will be in the sendrecv phase, we can recalculate the total cycles by just adjusting
the number of cycles in the sendrecv phase. This is how we obtained the estimated
data for 30 and 55 tiles shown for the border routing case.
The advantages of adding tiles to the 352x240 case will diminish quickly over 64
tiles largely because there just is not enough granularity in the partitioning at this
point.
The 720x480 case, on the other hand, still has the potential to scale well
up to 64 tiles.
Figure 8-7 shows that even with the generality of full routing. we
can achieve greater than real-time performance on DVD quality video. In addition,
because it was so central to the goals of this thesis, a decision was made to make
the effort to write the border routing code for this case so that the results would
be indisputable. The actual encoding of the 30 frame sequence on 64 tiles took the
simulator approximately 3 weeks to finish. The results show that a frame rate of 52
frames/sec on 720x480 DVD quality video was achieved when using border routin'.
3
Notle that the 64 tile graphs have "flat spots" corresponding to areas where the partitioning
from Table 6.1 is such that there is no way to take advantage of additional tiles
66
16
16
14
12
10
U)
CL
8
8
8)
)
E
6
4
4
2
2
-0
1
4
2
16
8
# of Tiles
0
Linear
O
0
Border Routing
Figure 8-5: Speedup and absolute
using hardware.
Full Routing
)erformance for a 720x480 sequence on 16 tiles
67
64
-280
-260
55
-240
-220
-200
-180
................... 45 160
CLi
6))
3..U-
30
-140
-.
U-
()-120
....--
-100
16
-
8
4
-
60
-
40
20
-
2-
1 2
4
I
8
I
I
II
16
1
30
I
32
55
64
# of Tiles
3
Linear
0
Border Routing
Full Routing
Figure 8-6: Speedup and absolute performance for a 352x240 sequence on 64 tiles
using simulator.
68
64
65
62
-
60
55
50
45
40
35
C-
(D
EU
4) 32
-30
LL
-25
-20
16-15
-.
10
16-
10
25
2
2
1 2
4
8
16
32
62
64
# of Tiles
0
Linear
0
Border Routing
Full Routing
Figure 8-7: Speedup and absolute performance for a 720x480 sequence on 64 tiles
using simulator.
69
Encoding Rate (frames / sec)
# Tiles
1
2
4
8
16
32
64
352x240 1 640x480
4.30
1.14
8.48
2.24
16.18
4.45
8.69
30.82
16.74
58.65
103est
158est
J720x480
1.00
1.97
3.84
7.52
14.57
30es1
51.90
Table 8.2: Frame rates for encoding various sequence sizes on various numbers of
tiles.
8.3
Performance Across Sample Files
In order to determine whether the content of the scene would significantly effect performance and to show the flexibility of the encoder to deal with various frame sizes,
we ran a number of test runs using a 640x480 pixel sequence.
Figure 8-8 and Ta-
ble 8.2 both demonstrate that although higher resolution sequences do exhibit better
speedup, the speedup on the 640x480 sequence was very close to that of the 720x480
sequence. This conclusion is consistent with other findings in the literature[22].
8.4
Effect of Arbitrary Slice Size on Image Quality
The arbitrary slice size methodology used to break the data dependency in the putpic
phase does result in a larger number of smaller slices in a given frame than would
result from either the baseline code or a single processor implementation. There is
valid reason for concern that this would negatively effect the image quality since it
can defeat the ability of the rate control function to stabilize and deliver the best
quality per output bit.
In order to allay these concerns, we ran tests to compare the signal to noise ratio of
files encoded with the baseline code with files encoded usingy our fully parallel version.
Table 8.3 shows the results of this, as well as the resulting file sizes. While SNR is
not a perfect indication of image quality, a simple visual inspection of the encoded
files will also make it clear that the differences are not significant.
70
16
-
15
-
14 -
13 -
12 -
11
10
9-
-0
(D
8
C)a)
7-
6-
54
3
2
I
1
I
2
I
3
I
4
I
5
I
6
I
7
I
8
I
9
I
10
I
11
I
12
I
13
14
15
16
# of Tiles
3
Linear
352x240
0
640x480
720x480
Figure 8-8: Comparison of speedups for the three sample sequences.
Code Base
baseline
parallel putpic
|Median
Luminance SNR (dB) I File Size (bytes)
21.0
142985
20.7
143150
Table 8.3: Comparison of SNR and file size for a 352x240 encoding using the baseline
code and the parallel putpic code.
71
Phase
hardware
cycles
software
cycles
memcpy
motion
predict
transform
putpic
iquant
itransform
sendrecv
22070673
418383200
21111079
60454453
187120927
32626213
65252426
152575525
15141855
412392870
17813947
55223235
185265048
30283710
59676722
114899957
correction
factor
1.46
1.01
1.19
1.09
1.01
1.08
1.09
1.33
Table 8.4: Calculating simulation correction factors for each phase of the MPEG
algorithm.
8.5
Comparing Hardware and Simulated Results
There was a noticeable difference between timings taken on the actual Raw hardware
and timings taken using the simulator, even when all other factors were identical.
These differences were, unfortunately, most acute in the two phases that most account
for the non-linearity in our speedup curves, namely memcpy and sendrecv. This is
most likely due to an inaccuracy in the modeling of memory accesses by the simulator.
Since there is currently no hardware testbed for more than 16 tiles, we decided that
when comparing data over 16 tiles that all data should be taken from the simulator.
In order to quantify the differences, it was necessary to compare the actual cycle
times for the various phases on identical 16 tile runs. This comparison is shown in
Table 8.4, with the correction factors for each phase.
Using this information, it is possible to apply the correction factors to the 64 tile
test runs'.
Figure 8-9 adds a, dotted line showing our projection of 48.3 frames/sec
for the actual performance of the MPEG encoder on 64 tile Raw hardware. Even
with this correction, we are well within the safety zone of real-time performance on
64 tiles and it is almost certain that a 40 tile Raw machine running at 425MHz will
be able to encode DVD quality video in real-time without any further improvements
to the code.
4
lhis assumes that the correction. factor between 16 tile simulator and 16 tile hardware is comparable to the factor between 64 tile simulator and 64 tile hardware.
72
64
-65
62 -
- 60
-
55
50
45
40
X
35
0
30
1..
32
U)...
..-
25
20
16
15
10
4
2
1
1
1 2
4
8
-0
62
32
16
0
64
# of Tiles
0
Linear
O
Border Routing (simulator)
Border Routing (h/w est.)
0
0
Full Routing (simulator)
Full Routing (h/w est.)
Figure 8-9: Comparison of a 720x480 encoding on the simulator and the hardware.
73
74
Chapter 9
Conclusion
This thesis showed that a general purpose microprocessor such as Raw can be effectively used to perform real-time on-line encoding of MPEG video. Importantly, it
also showed that this can be achieved using a public domain code base written in
C as the starting point and with minimal considerations by the programmer as to
the underlying architecture of the hardware. The initial stated goals of a parallel.
well-partitioned and scalable implementation were achieved and demonstrated with
extensive testing on both the Raw simulator and the Raw prototype hardware.
9.1
Further Work
Most of the modifications to the MSSG code base that were done for this thesis were
done for the purpose of extracting the most parallelism possible out of the algorithm
on the Raw architecture. Apart from Hank's effort to speed up the motion estimation
phase, little was done to improve the absolute speed of code. In particular, large
sections of unused code remain simply because it was easier and safer to leave them
in. There are likely large absolute performance gains to be had by simply improving
the baseline code, and these gains would likely have no negative impact on scalability.
It was also shown that a more thoughtful allocation of the regions of the image
to the tiles could reap large performance gains in the communication phases. A more
"square' mapping that explicitly takes into account the physical configuration of the
75
allocated tiles would materially improve performance of any implementation using 64
or more tiles. Coming up with a way of allocating data that was provably "best fit"
would be an interesting goal.
Lastly, some thought needs to be given to how this code could truly be made
to co-exist with other code running on a Raw microprocessor.
In the interest of
convenience, the implementation described in this thesis used the switch processors
in idle tiles to route between active tiles.
If these other tiles were running other
tasks, this is not something that would be generically possible.
As such, a future
implementation should modify the routes to assure that only the switches in active
tiles are used for routing.
76
Bibliography
[1] Ishfaq Ahmnad, Shahriar M. Akramullah, Ming L. Liou, and Muhammad Kafil. A
scalable off-line MPEG-2 video encoding scheme using a multiprocessor system.
Parallel Computing, 27(6):823-846, 2001.
[2] Jung Ho Ahn, Williani J. Dally, Brucek Khailany, Ujval J. Kapasi, and Abhishek Das. Evaluating the Imagine Stream Architecture. SIGARCH Computer
Architecture News, 32(2):14, 2004.
[3] Shahriar M. Akramullah, Ishfaq Ahmad, and Ming L. Liou. A data-parallel approach for real-time MPEG-2 video encoding. Journal of Paralleland Distributed
Computing, 30(2):129-146, 1995.
[4] Shahriar M. Akranullah, Ishfaq Ahnmad, and Ming L. Lion. Parallel MPEG-2 encoder on ATM and Ethernet-connected workstations. Lecture Notes in Computer
Science, 1557:572-574. 1999.
[5] Ismail Assayad, Philippe Gerner, Sergio Yovine, and Valerie Bertin. Modelling,
analysis and parallel implementation of an on-line video encoder. In DFMIA '05:
Proceedings of the First Inter'nationo Conference on Distributed Framew£ a orks for
Multimedia Applications (DFMA '05). pages 295-302, Washington, DC, USA,
2005. IEEE Computer Society.
[6] BernClare
Multimedia
Inc.
DHL
http: //www. stream-video . com/digitaldemo . htm, 2005.
77
Commercial.
[7] Vasudev Bhaskaran and Konstantinos Konstantinides. Irage and Video Cornpression Standards. Kluiwer Academic Publishers, Norwell, MA, 1997.
[8] S. Bozoki, S.J.P. Westen, R.L. Lagendijk, and J. Biemond. Parallel algorithms
for MPEG video compression with PV.M. In International Conference HPCN
challengcs in Telecom p and Telecom, August 1996.
[9] [ISO] International Organization for Standardization.
Information technology
Gcnceric coding of mooinrg pictures and associated audio information: Video.
ISO/IEC 13818-2:2000(E). Geneva, 2000.
[101 Kevin L. Gong and Lawrence A. Rowe. Parallel MPEG-1 video encoding. In
Proceedings of the International Picture Coding Symposium (PCS'93), 1994.
[11] Eiji Iwata and Kunle Olukotun.
MPEG-2 algorithm.
Exploiting coarse-grain parallelism in the
Technical Report CSL-TR-98-771,
Stanford University
Computer Systems Laboratory, Stanford, CA, September 1998.
[12] Josh Cates.
Chips Tricks.
http: //www. cates-online. com/chiptricks .cfm,
2005.
[13] Brucek Khailany, William J. Dally, Ujval J. Kapasi. Peter Mattson, Jinyung
Namkoong,
Rixner.
John D.
Owens,
Brian
Towles,
Andrew
Imagine: Media Processing with Streams.
Chang,
and Scott
IEEE Alicro, 21(2):35-46,
March/April 2001.
[14] Joao Paulo Kitajima, Denilson Barbosa, and Jr. Wagner Meira. Parallelizing
MPEG video encoding using multiprocessors.
In SIBGRAPI '99: Proceedings
of the XII Brazilian Symposium on Computer Graphics and Image Processing,
pages 215 -222, Washington, DC, USA, 1999. IEEE Computer Society.
[15] Takeshi Kodaka, Hirohumi Nakano., Keiji Kimura, and Hironori Kasahara. Parallelization of MPEG-2 Video Encoder for Parallel and Distributed Computing
Systems. In Proceedings of 38th Midwest Synposihnn on Circutiits and Systems,
August 1996.
78
[16] Takeshi Kodaka, Hirohnuni Nakano, Keiji Kimura, and Hironori Kasahara. Parallel Processing using Data Localization for MPEG2 Encoding- on OSCAR
Chip Multiprocessor.
In Proceedings of International Worksh op on Innovative
Architecture for Future Generation High-Performance Processors and Systemns
(IWIA'04), January 2004.
[17] Jeffrey Moore, William Lee, Scott Dawson, and Brian Smith. Optimal Parallel
MPEG Encoding. Technical report, Ithaca, NY, USA, 1996.
[18] MPEG Software Simulation Group.
http://www.mpeg.org/MPEG/MSSG/,
mpeg2encode/mnpeg2decode version 1.2.
1996.
[19] Jongho Nang and Junwha Kim. An effective parallelizing scheme of MPEG-i
video encoding on Ethernet-connected workstations. In APDC '97. Proceedings
of the 1997 Advances in Paralleland Distributed Computing Conference (APDC
'97), page 4, Washinoton, DC, USA, 1997. IEEE Computer Society.
[20] Ke Ning, Gabby Yi, and Rick Gentile. Single-Chip Dual-Core Embedded Programiming Models for Multimedia Applications. ECNI,
February 2005.
[21] Ke Shen and Edward J. Delp. A parallel implementation of an MPEG encoder:
Faster than real-time! In Proceedings of the SPIE Conference on Digital Video
Compression: Algorithms and Technologies. pages 407-418, February 1995.
[22] Ke Shen and Edward J. Delp. A spatial-temporal parallel approach for real-time
MPEG video compression. In ICPP, Vol. 2. pages 100-107, 1996.
[23] Michael B. Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben
Greenwald, Henry Hoffmann, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert
Ma, Arvind Saraf, Mark Seneski, Nathan Shnidmuan, Volker Strunpen, Matt
Frank, Sanman Amnarasinghe, and Anant Agarwal.
The Raw Microprocessor:
A Computational Fabric for Software Circuits and General-Purpose Programs.
IEEE Micro, 22(2):25-36, March/April 2002.
79
[24] Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt,
Ben Greenwald, Henry Hoffnann. Paul Johnson, Jason Kim, James Psota,
Arvind Saraf, Nathan Shnidman. Volker Strumpen. Matt Frank, Saman Amarasinghe, and Anant Agarwal. Evaluation of the Raw Microprocessor: An ExposedWire-Delay Architecture for ILP and Streams. In ISCA '04: Proceedings of the
31st annual internationalsymposium on Corrputer architectwre, page 2, Washington, DC, USA, 2004. IEEE Computer Society.
[25] Y. Yu and D. Anastassiou. Software implementation of MPEG-2 video encoding
using socket programming in LAN. In Proceedings of the SPIE Conference on
Digital Video Compression: Algorithms and Technoloqies, pages 229-240, February 1994.
80