Intel’s MMX Dr. Richard Enbody CSE 820

advertisement
Intel’s MMX
Dr. Richard Enbody
CSE 820
Why MMX?
Make the Common Case Fast
• Multimedia and Communication
consume significant computing
resources.
• Providing specific hardware support
makes sense.
Michigan State University
Computer Science and Engineering
Goals
• accelerate multimedia and
communications applications.
• maintain full compatibility with existing
operating systems and applications.
• exploit inherent parallelism in multimedia
and communication algorithms
• includes new instructions and data types
to improve performance.
Michigan State University
Computer Science and Engineering
First Step: examine code
• Examined a wide range of applications:
graphics, MPEG video, music
synthesis, speech compression, speech
recognition, image processing, games,
video conferencing.
• Identified and analyzed the most
compute-intensive routines
Michigan State University
Computer Science and Engineering
Common Characteristics
• Small integer data types:
e.g. 8-bit pixels, 16-bit audio samples
• Small, highly repetitive loops
• Frequent multiply-and-accumulate
• Compute-intensive algorithms
• Highly parallel operations
Michigan State University
Computer Science and Engineering
MMX Technology
A set of basic, general purpose integer
instructions:
• Single Instruction, Multiple Data (SIMD)
• 57 new instructions
• Eight 64-bit wide MMX registers
• Four new data types
Michigan State University
Computer Science and Engineering
Data Types
Michigan State University
Computer Science and Engineering
Data Types
Michigan State University
Computer Science and Engineering
Example
• Pixels are generally 8-bit integers. Pack
eight pixels into a 64-bit MMX register.
• An MMX instruction takes all eight of the
pixels at once from the MMX register,
performs the arithmetic or logical
operation on all eight elements in
parallel, and writes the result into an
MMX register.
Michigan State University
Computer Science and Engineering
Compatibility
• No new exceptions or states are added.
• Aliases to existing FP registers:
The exponent field of the corresponding
floating-point register (bits 64-78) and
the sign bit (bit 79) are set to ones (1's),
making the value in the register a NaN
(Not a Number) or infinity when viewed
as a floating-point value.
Michigan State University
Computer Science and Engineering
Michigan State University
Computer Science and Engineering
57 Instructions
• Basic arithmetic: add, subtract, multiply,
arithmetic shift and multiply-add
• Comparison
• Conversion: pack & unpack
• Logical
• Shift
• Move: register-to-register
• Load/Store: 64-bit and 32-bit
Michigan State University
Computer Science and Engineering
Packed Add Word
with wrap around
•Each Addition is independent
•Rightmost overflows and wraps around
Michigan State University
Computer Science and Engineering
Saturation
• Saturation: if addition results in overflow
or underflow, the result is clamped to the
largest or smallest value representable.
• This is important for pixel calculations
where this would prevent a wrap-around
add from causing a black pixel to
suddenly turn white
Michigan State University
Computer Science and Engineering
No Mode
There is no "saturation mode bit”:
a new mode bit would require
a change to the operating system.
Separate instructions are used
to generate wrap-around
and saturating results.
Michigan State University
Computer Science and Engineering
Packed Add Word
with unsigned saturation
•Each Addition is independent
•Rightmost saturates
Michigan State University
Computer Science and Engineering
Multiply-Accumulate
multiply-accumulate operations are
fundamental to many signal processing
algorithms like vector-dot-products,
matrix multiplies, FIR and IIR Filters,
FFTs, DCTs etc
Michigan State University
Computer Science and Engineering
Packed Multiply-Add
Multiply bytes generating four 32-bit results.
Add the 2 products on the left for one result and
the 2 products on the right for the other result.
Michigan State University
Computer Science and Engineering
Packed Parallel Compare
• No new condition code flags
• No existing IA condition code flags are
affected by this instruction.
• Result can be used as a mask to select
elements from different inputs using a
logical operation, eliminating branchs.
Michigan State University
Computer Science and Engineering
Packed Parallel Compare
Michigan State University
Computer Science and Engineering
Pack/Unpack
• Important when an algorithm needs
higher precision in its intermediate
calculations, as in image filtering.
• For example, image filtering involves a
set of intermediate multiply operations
between filter coefficients and a set of
adjacent image pixels, accumulating all
the values together.
Michigan State University
Computer Science and Engineering
Pack
Michigan State University
Computer Science and Engineering
Conditional Select
The Chroma Keying example
demonstrates how conditional selection
using the MMX instruction set removes
branch mis-predictions, in addition to
performing multiple selection operations
in parallel. Text overlay on a pix/video
background, and sprite overlays in
games are some of the other operations
that would benefit from this technique.
Michigan State University
Computer Science and Engineering
Chroma Keying
Michigan State University
Computer Science and Engineering
Chroma Keying (con’t)
• Take pixels from the picture with the
woman on a green background.
• A compare instruction builds a mask for
that data. That mask is a sequence of
bytes that are all ones or all zeros.
• We now know what is the unwanted
background and what we want to keep.
Michigan State University
Computer Science and Engineering
Create Mask
Assume pixels alternate green/not_green
Michigan State University
Computer Science and Engineering
Combine: !AND, AND, OR
Michigan State University
Computer Science and Engineering
Branch Removal
Without MMX technology, each pixel is
processed separately and requires a
conditional branch. Using MMX
instructions, eight 8-bit pixels can be
processed in parallel and no conditional
branches are involved.
Michigan State University
Computer Science and Engineering
Vector Dot Product
• The vector dot product is one of the
most basic algorithms used in signalprocessing of natural data such as
images, audio, video and sound.
• PMADD does 4 multiplies and 2 adds at
a time. Coupled with PADD, eight
multiply-accumulate operations can be
performed: 2 PMADD and 2 PADD
Michigan State University
Computer Science and Engineering
Vector Dot Product
Michigan State University
Computer Science and Engineering
Vector Dot Product
Michigan State University
Computer Science and Engineering
Vector Dot Product
Assuming precision is sufficient, a dotproduct on an 8-element vector can be
completed using 8 MMX instructions:
2 PMADDs, 2 PADDs, two shifts (if
needed to fix the precision after the
multiply), and 2 loads for one of the
vectors (the other vector is loaded by the
PMADD instruction which can have one
of its operands come from memory).
Michigan State University
Computer Science and Engineering
Compare
Load
Multiply
Shift
Add
Misc
Store
Total
W/O MMX
16
8
8
7
-1
40
Michigan State University
Computer Science and Engineering
WITH MMX
4
2
2
1
3
1
13
Compare
• With MMX technology, one third of the
number of instructions is needed.
• Most MMX instructions can be executed
in one clock cycle, so the performance
improvement will be more dramatic than
the simple ratio of instruction counts.
Michigan State University
Computer Science and Engineering
Matrix Multiply
3D games: computations that manipulate
3D objects use 4-by-4 matrices that are
multiplied with 4-element vectors many
times. Each vector has the X,Y, Z and
perspective corrective information for
each pixel. The 4-by-4 matrix is used to
rotate, scale, translate and update the
perspective corrective information for
each pixel.
Michigan State University
Computer Science and Engineering
Michigan State University
Computer Science and Engineering
Compare
Load
Multiply
Add
Misc
Store
Total
W/O MMX
32
16
12
8
4
72
WITH MMX
6
4
2
12
4
28
Michigan State University
Computer Science and Engineering
Matrix Multiply
• MMX required half the instructions.
Michigan State University
Computer Science and Engineering
Image Dissolve
Using Alpha Blending
• Dissolve a Swan into a Flower
Result_pixel =
Flower_pixel * (alpha/255) +
Swan_pixel * [1 - (alpha/255)]
• Assume 640x480 resolution
Michigan State University
Computer Science and Engineering
Dissolve: Millions of Inst.
Load
Unpack
Multiply
Add
Pack
Store
Total
W/O MMX
470
-470
235
-235
1,400
WITH MMX
117
117
117
58
58
58
525
Michigan State University
Computer Science and Engineering
Dissolve
1 billion fewer instructions
for the 640x480 dissolve
Michigan State University
Computer Science and Engineering
Michigan State University
Computer Science and Engineering
Conclusion
• MMX appeared in 1997 in Pentium
processors (with bigger cache).
• According to Intel, an MMX
microprocessor runs a multimedia
application up to 60% faster.
In addition, it runs other applications
about 10% faster
Michigan State University
Computer Science and Engineering
Download