CS 61C: Great Ideas in Computer Architecture (Machine Structures) SIMD I Instructors:

advertisement
CS 61C: Great Ideas in Computer
Architecture (Machine Structures)
SIMD I
Instructors:
Randy H. Katz
David A. Patterson
http://inst.eecs.Berkeley.edu/~cs61c/sp11
6/27/2016
Spring 2011 -- Lecture #13
1
6/27/2016
Spring 2011 -- Lecture #13
2
Review
• To access cache, Memory Address divided into 3 fields:
Tag, Index, Block Offset
• Cache size is Data + Management (tags, valid, dirty bits)
• Write misses trickier to implement than reads
– Write back vs. Write through
– Write allocate vs. No write allocate
• Cache Performance Equations:
– CPU time = IC × CPIstall × CC
= IC × (CPIideal + Memory-stall cycles) × CC
– AMAT = Time for a hit + Miss rate x Miss penalty
• If understand caches, can adapt software to improve
cache performance and thus program performance
6/27/2016
Spring 2011 -- Lecture #12
3
New-School Machine Structures
(It’s a bit more complicated!)
Software
• Parallel Requests
Assigned to computer
e.g., Search “Katz”
Hardware
Harness
Smart
Phone
Warehouse
Scale
Computer
• Parallel Threads Parallelism &
Assigned to core
e.g., Lookup, Ads
Achieve High
Performance
Computer
• Parallel Instructions
>1 instruction @ one time
e.g., 5 pipelined instructions
Memory
• Hardware descriptions
All gates @ one time
6/27/2016
Today’s
Lecture
Core
(Cache)
Input/Output
• Parallel Data
>1 data item @ one time
e.g., Add of 4 pairs of words
…
Core
Instruction Unit(s)
Core
Functional
Unit(s)
A0+B0 A1+B1 A2+B2 A3+B3
Main Memory
Logic Gates
Spring 2011 -- Lecture #13
4
Agenda
•
•
•
•
•
•
Flynn Taxonomy
Administrivia
DLP and SIMD
Technology Break
Intel SSE
(Amdahl’s Law if time permits)
6/27/2016
Spring 2011 -- Lecture #13
5
Alternative Kinds of Parallelism:
The Programming Viewpoint
• Job-level parallelism/process-level parallelism
– Running independent programs on multiple
processors simultaneously
– Example?
• Parallel processing program
– Single program that runs on multiple processors
simultaneously
– Example?
6/27/2016
Spring 2011 -- Lecture #13
6
Alternative Kinds of Parallelism:
Hardware vs. Software
• Concurrent software can also run on serial hardware
• Sequential software can also run on parallel hardware
• Focus is on parallel processing software: sequential or
concurrent software running on parallel hardware
6/27/2016
Spring 2011 -- Lecture #13
7
Alternative Kinds of Parallelism:
Single Instruction/Single Data Stream
• Single Instruction,
Single Data stream
(SISD)
Processing Unit
6/27/2016
– Sequential computer
that exploits no
parallelism in either the
instruction or data
streams. Examples of
SISD architecture are
traditional uniprocessor
machines
Spring 2011 -- Lecture #13
8
Alternative Kinds of Parallelism:
Multiple Instruction/Single Data Stream
• Multiple Instruction,
Single Data streams
(MISD)
6/27/2016
– Computer that exploits
multiple instruction
streams against a single
data stream for data
operations that can be
naturally parallelized.
For example, certain
kinds of array
processors.
– No longer commonly
encountered, mainly of
Spring 2011 -- Lecture #13
historical interest only
9
Alternative Kinds of Parallelism:
Single Instruction/Multiple Data Stream
• Single Instruction,
Multiple Data streams
(SIMD)
– Computer that exploits
multiple data streams
against a single
instruction stream to
operations that may be
naturally parallelized,
e.g., SIMD instruction
extensions or Graphics
Processing Unit (GPU)
6/27/2016
Spring 2011 -- Lecture #13
10
Alternative Kinds of Parallelism:
Multiple Instruction/Multiple Data Streams
• Multiple Instruction,
Multiple Data streams
(MIMD)
– Multiple autonomous
processors simultaneously
executing different
instructions on different
data.
– MIMD architectures
include multicore and
Warehouse Scale
Computers
– (Discuss after midterm)
6/27/2016
Spring 2011 -- Lecture #13
11
Flynn Taxonomy
• In 2011, SIMD and MIMD most common parallel computers
• Most common parallel processing programming style:
Single Program Multiple Data (“SPMD”)
– Single program that runs on all processors of an MIMD
– Cross-processor execution coordination through conditional
expressions (thread parallelism after midterm )
• SIMD (aka hw-level data parallelism): specialized function
units, for handling lock-step calculations involving arrays
– Scientific computing, signal processing, multimedia (audio/video
processing)
6/27/2016
Spring 2011 -- Lecture #13
12
Data-Level Parallelism (DLP)
(from 2nd lecture, January 20)
• 2 kinds of DLP
– Lots of data in memory that can be operated on
in parallel (e.g., adding together 2 arrays)
– Lots of data on many disks that can be operated
on in parallel (e.g., searching for documents)
• 2nd lecture (and 1st project) did DLP across
10s of servers and disks using MapReduce
• Today’s lecture (and 3rd project) does Data
Level Parallelism (DLP) in memory
6/27/2016
Spring 2011 -- Lecture #13
13
SIMD Architectures
• Data parallelism: executing one operation on
multiple data streams
• Example to provide context:
– Multiplying a coefficient vector by a data vector
(e.g., in filtering)
y[i] := c[i]  x[i], 0  i < n
• Sources of performance improvement:
– One instruction is fetched & decoded for entire
operation
– Multiplications are known to be independent
– Pipelining/concurrency in memory access as well
6/27/2016
Spring 2011 -- Lecture #13
Slide 14
“Advanced Digital Media Boost”
• To improve performance, Intel’s SIMD instructions
– Fetch one instruction, do the work of multiple instructions
– MMX (MultiMedia eXtension, Pentium II processor family)
– SSE (Streaming SIMD Extension, Pentium III and beyond)
6/27/2016
Spring 2011 -- Lecture #13
15
Example: SIMD Array Processing
for each f in array
f = sqrt(f)
for each f in
{
load f to
calculate
write the
}
array
the floating-point register
the square root
result from the register to memory
for each 4 members in array
{
load 4 members to the SSE register
calculate 4 square roots in one operation
write the result from the register to memory
}
6/27/2016
Spring 2011 -- Lecture #13
16
Administrivia
• Lab #7 posted
• Midterm in 1 week:
– Exam: Tu, Mar 8, 6-9 PM, 145/155 Dwinelle
• Split: A-Lew in 145, Li-Z in 155
–
–
–
–
–
Covers everything through lecture March 3
Closed book, can bring one sheet notes, both sides
Copy of Green card will be supplied
No phones, calculators, …; just bring pencils & eraser
TA Review: Su, Mar 6, 2-5 PM, 2050 VLSB
• Sent (anonymous) 61C midway survey before
Midterm
6/27/2016
Spring 2011 -- Lecture #12
17
Scores on Project 2 Part 2
85
Score (Max 85)
75
65
55
• Top 25%: ≥79 / 85
45
35
• Next 50%: ≥60,<79 / 85
25
15
5
-5 0
0.2
0.4
0.6
0.8
1
Fraction of Students
6/27/2016
Spring 2011 -- Lecture #13
18
• Inclusive: all welcome,it works!
– 82%: reaffirms CS major, will finish degree
•
– 30% ugrads, 40% grads, 30%
•
in beautiful
San Francisco
tapiaconference.org/2011
CDC is a joint org. of ACM,IEEE/CS,CRA
If8 great
you care,
come!
speakers
on Grad
School
Success,
•Workshops
Volunteer
poster
for
Early Career Success, Resume
• Luminiaries: Deborah Estrin UCLA,
student+opportunities
Preparation
BOFs
• Banquet
and
Dance or import
Blaise Aguera y Arcas Microsoft,
(work
remote
Francisco Activity: Alcatraz
Alan Eustace Google, Bill Wulf UVA, • Sanstudent)
Tour, Chinatown, Bike over Golden
Irving Wladawsky-Berger IBM,
Bridge, … grad students to
•Gate
Encourage
• If interested
in diversity,
by today
John Kubiatowicz UC Berkeley
apply
doctoral
consortium
(3/1) email Sheila Humphrys with
name, year, topic interest + 2 to 3
• Rising Stars: Hicks Rice, Howard
sentences why want to go to Tapia
Georgia Tech, Lopez Intel
humphrys@EECS.Berkeley.EDU
• General Chair: Dave Patterson
6/27/2016
Spring 2011 -- Lecture #11
http://tapiaconference.org/2011/participate.html
19
Hicks
Kubiatowicz
Eustace
Wladawsky
Howard
Wulf
Organizers
Lopez
Estrin
Tapia Awardee
Speakers
Aguera y Arcas
6/27/2016
Patterson
Taylor
Spring 2011 -- Lecture #11
Tapia
Lanius
Vargas
20
Perez-Quinones
Agenda
•
•
•
•
•
•
Flynn Taxonomy
Administrivia
DLP and SIMD
Technology Break
Intel SSE
(Amdahl’s Law if time permits)
6/27/2016
Spring 2011 -- Lecture #13
21
SSE Instruction Categories
for Multimedia Support
• SSE-2+ supports wider data types to allow
16 x 8-bit and 8 x 16-bit operands
6/27/2016
Spring 2011 -- Lecture #13
22
Intel Architecture SSE2+
128-Bit SIMD Data Types
122 121 96 95
80 79
64 63
48 47
32 31
16 15
16 / 128 bits
122 121 96 95
80 79
64 63
48 47
32 31
16 15
8 / 128 bits
96 95
64 63
32 31
64 63
4 / 128 bits
2 / 128 bits
• Note: in Intel Architecture (unlike MIPS) a word is 16 bits
– Single precision FP: Double word (32 bits)
– Double precision FP: Quad word (64 bits)
6/27/2016
Spring 2011 -- Lecture #13
23
XMM Registers
• Architecture extended with eight 128-bit data registers: XMM registers
– IA 64-bit address architecture: available as 16 64-bit registers (XMM8 – XMM15)
– E.g., 128-bit packed single-precision floating-point data type (doublewords),
allows four single-precision operations to be performed simultaneously
6/27/2016
Spring 2011 -- Lecture #13
24
SSE/SSE2 Floating Point Instructions
xmm: one operand is a 128-bit SSE2 register
mem/xmm: other operand is in memory or an SSE2 register
{SS} Scalar Single precision FP: one 32-bit operand in a 128-bit register
{PS} Packed Single precision FP: four 32-bit operands in a 128-bit register
{SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register
{PD} Packed Double precision FP, or two 64-bit operands in a 128-bit register
{A} 128-bit operand is aligned in memory
{U} means the 128-bit operand is unaligned in memory
{H} means move the high half of the 128-bit operand
{L} means move the low half of the 128-bit operand
6/27/2016
Spring 2011 -- Lecture #13
25
Example: Add Two Single Precision
FP Vectors
Computation to be performed:
vec_res.x
vec_res.y
vec_res.z
vec_res.w
=
=
=
=
v1.x
v1.y
v1.z
v1.w
+
+
+
+
v2.x;
v2.y;
v2.z;
v2.w;
SSE Instruction Sequence:
mov a ps : move from mem to XMM register,
memory aligned, packed single precision
add ps : add from mem to XMM register,
packed single precision
mov a ps : move from XMM register to mem,
memory aligned, packed single precision
movaps address-of-v1, %xmm0
// v1.w | v1.z | v1.y | v1.x -> xmm0
addps address-of-v2, %xmm0
// v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x ->
xmm0
movaps %xmm0, address-of-vec_res
6/27/2016
Spring 2011 -- Lecture #13
26
Example: Image Converter
• Converts BMP (bitmap) image to a YUV (color
space) image format:
– Read individual pixels from the BMP image,
convert pixels into YUV format
– Can pack the pixels and operate on a set of pixels with
a single instruction
• E.g., bitmap image consists of 8 bit monochrome
pixels
– Pack these pixel values in a 128 bit register (8 bit * 16
pixels), can operate on 16 values at a time
– Significant performance boost
6/27/2016
Fall 2010 -- Lecture #18
28
Example: Image Converter
• FMADDPS – Multiply and add packed single
precision floating point instruction
• One of the typical operations computed in
transformations (e.g., DFT of FFT)
N
P = ∑ f(n) × x(n)
n=1
6/27/2016
Spring 2011 -- Lecture #13
29
Example: Image Converter
Floating point numbers f(n) and x(n) in src1 and src2; p in
dest;
C implementation for N = 4 (128 bits):
for (int i =0; i< 4; i++)
p = p + src1[i] * src2[i];
Regular x86 instructions for the inner loop:
//src1 is on the top of the stack; src1 * src2 -> src1
fmul DWORD PTR _src2$[%esp+148]
//p = ST(1), src1 = ST(0); ST(0)+ST(1) -> ST(1); ST-Stack Top
faddp %ST(0), %ST(1)
(Note: Destination on the right in x86 assembly)
Number regular x86 Fl. Pt. instructions executed: 4 * 2 = 8
6/27/2016
Spring 2011 -- Lecture #13
30
Example: Image Converter
Floating point numbers f(n) and x(n) in src1 and src2; p in dest;
C implementation for N = 4 (128 bits):
for (int i =0; i< 4; i++)
p = p + src1[i] * src2[i];
• SSE2 instructions for the inner loop:
//xmm0 = p, xmm1 = src1[i], xmm2 = src2[i]
mulps %xmm1, %xmm2 // xmm2 * xmm1 -> xmm2
addps %xmm2, %xmm0 // xmm0 + xmm2 -> xmm0
• Number regular instructions executed: 2 SSE2 instructions vs. 8 x86
• SSE5 instruction accomplishes same in one instruction:
fmaddps %xmm0, %xmm1, %xmm2, %xmm0
// xmm2 * xmm1 + xmm0 -> xmm0
// multiply xmm1 x xmm2 paired single,
// then add product paired single to sum in xmm0
• Number regular instructions executed: 1 SSE5 instruction vs. 8 x86
6/27/2016
Spring 2011 -- Lecture #13
31
Intel SSE Intrinsics
• Intrinsics are C functions and procedures for putting
in assembly language, including SSE instructions
– With intrinsics, can program using these instructions
indirectly
– One-to-one correspondence between SSE instructions and
intrinsics
6/27/2016
Spring 2011 -- Lecture #13
32
Example SSE Intrinsics
Instrinsics:
Corresponding SSE instructions:
• Vector data type:
_m128d
• Load and store operations:
_mm_load_pd
MOVAPD/aligned, packed double
_mm_store_pd
MOVAPD/aligned, packed double
_mm_loadu_pd
MOVUPD/unaligned, packed double
_mm_storeu_pd
MOVUPD/unaligned, packed double
• Load and broadcast across vector
_mm_load1_pd
MOVSD + shuffling/duplicating
• Arithmetic:
_mm_add_pd
ADDPD/add, packed double
_mm_mul_pd
MULPD/multiple, packed double
02/09/2010
6/27/2016
CS267
Spring
2011Lecture
-- Lecture7#13
33 33
Example: 2 x 2 Matrix Multiply
Definition of Matrix Multiply:
2
Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j
k=1
A1,1
A1,2
B1,1
B1,2
x
A2,1
6/27/2016
A2,2
C1,1=A1,1B1,1 + A1,2B2,1
C1,2=A1,1B1,2+A1,2B2,2
C2,1=A2,1B1,1 + A2,2B2,1
C2,2=A2,1B1,2+A2,2B2,2
=
B2,1
B2,2
Spring 2011 -- Lecture #13
34
Example: 2 x 2 Matrix Multiply
• Using the XMM registers
– 64-bit/double precision/two doubles per XMM reg
C1
C1,1
C2,1
C2
C1,2
C2,2
A
A1,i
A2,i
B1
Bi,1
Bi,1
B2
Bi,2
Bi,2
6/27/2016
Stored in memory in Column order
Spring 2011 -- Lecture #13
35
Example: 2 x 2 Matrix Multiply
• Initialization
C1
0
0
C2
0
0
A
A1,1
A2,1
_mm_load_pd: Stored in memory in
Column order
B1
B1,1
B1,1
B2
B1,2
B1,2
_mm_load1_pd: SSE instruction that loads
a double word and stores it in the high and
low double words of the XMM register
• I=1
6/27/2016
Spring 2011 -- Lecture #13
36
Example: 2 x 2 Matrix Multiply
• Initialization
C1
0
0
C2
0
0
A
A1,1
A2,1
_mm_load_pd: Load 2 doubles into XMM
reg, Stored in memory in Column order
B1
B1,1
B1,1
B2
B1,2
B1,2
_mm_load1_pd: SSE instruction that loads
a double word and stores it in the high and
low double words of the XMM register
(duplicates value in both halves of XMM)
• I=1
6/27/2016
Spring 2011 -- Lecture #13
37
Example: 2 x 2 Matrix Multiply
• First iteration intermediate result
C1
0+A1,1B1,1
0+A2,1B1,1
C2
0+A1,1B1,2
0+A2,1B1,2
• I=1
c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
A
A1,1
A2,1
_mm_load_pd: Stored in memory in
Column order
B1
B1,1
B1,1
B2
B1,2
B1,2
_mm_load1_pd: SSE instruction that loads
a double word and stores it in the high and
low double words of the XMM register
(duplicates value in both halves of XMM)
6/27/2016
Spring 2011 -- Lecture #13
38
Example: 2 x 2 Matrix Multiply
• First iteration intermediate result
C1
0+A1,1B1,1
0+A2,1B1,1
C2
0+A1,1B1,2
0+A2,1B1,2
• I=2
c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
A
A1,2
A2,2
_mm_load_pd: Stored in memory in
Column order
B1
B2,1
B2,1
B2
B2,2
B2,2
_mm_load1_pd: SSE instruction that loads
a double word and stores it in the high and
low double words of the XMM register
(duplicates value in both halves of XMM)
6/27/2016
Spring 2011 -- Lecture #13
39
Example: 2 x 2 Matrix Multiply
• Second iteration intermediate result
C1
C2
C2,1
C1,1
A1,1B1,1+A1,2B2,1 A2,1B1,1+A2,2B2,1
A1,1B1,2+A1,2B2,2 A2,1B1,2+A2,2B2,2
C2,2
C1,2
• I=2
c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
A
A1,2
A2,2
_mm_load_pd: Stored in memory in
Column order
B1
B2,1
B2,1
B2
B2,2
B2,2
_mm_load1_pd: SSE instruction that loads
a double word and stores it in the high and
low double words of the XMM register
(duplicates value in both halves of XMM)
6/27/2016
Spring 2011 -- Lecture #13
40
Live Example: 2 x 2 Matrix Multiply
Definition of Matrix Multiply:
2
Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j
k=1
A1,1
A1,2
B1,1
B1,2
x
C1,1=A1,1B1,1 + A1,2B2,1
C1,2=A1,1B1,2+A1,2B2,2
=
A2,1
A2,2
B2,1
B2,2
C2,1=A2,1B1,1 + A2,2B2,1
C2,2=A2,1B1,2+A2,2B2,2
1
0
1
3
C1,1= 1*1 + 0*2 = 1
C1,2= 1*3 + 0*4 = 3
C2,1= 0*1 + 1*2 = 2
C2,2= 0*3 + 1*4 = 4
x
0
6/27/2016
1
=
2
4
Spring 2011 -- Lecture #13
41
Example: 2 x 2 Matrix Multiply
(Part 1 of 2)
#include <stdio.h>
// header file for SSE compiler intrinsics
#include <emmintrin.h>
// NOTE: vector registers will be represented in
comments as v1 = [ a | b]
// where v1 is a variable of type __m128d and
a, b are doubles
int main(void) {
// allocate A,B,C aligned on 16-byte boundaries
double A[4] __attribute__ ((aligned (16)));
double B[4] __attribute__ ((aligned (16)));
double C[4] __attribute__ ((aligned (16)));
int lda = 2;
int i = 0;
// declare several 128-bit vector variables
__m128d c1,c2,a,b1,b2;
6/27/2016
// Initialize A, B, C for example
/* A =
(note column order!)
10
01
*/
A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0;
/* B =
(note column order!)
13
24
*/
B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0;
/* C =
(note column order!)
00
00
*/
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
Spring 2011 -- Lecture #13
42
Example: 2 x 2 Matrix Multiply
(Part 2 of 2)
// used aligned loads to set
// c1 = [c_11 | c_21]
c1 = _mm_load_pd(C+0*lda);
// c2 = [c_12 | c_22]
c2 = _mm_load_pd(C+1*lda);
for (i = 0; i < 2; i++) {
/* a =
i = 0: [a_11 | a_21]
i = 1: [a_12 | a_22]
*/
a = _mm_load_pd(A+i*lda);
/* b1 =
i = 0: [b_11 | b_11]
i = 1: [b_21 | b_21]
*/
b1 = _mm_load1_pd(B+i+0*lda);
/* b2 =
i = 0: [b_12 | b_12]
i = 1: [b_22 | b_22]
*/
b2 = _mm_load1_pd(B+i+1*lda);
6/27/2016
/* c1 =
i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11]
i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21]
*/
c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
/* c2 =
i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12]
i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22]
*/
c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
}
// store c1,c2 back into C for completion
_mm_store_pd(C+0*lda,c1);
_mm_store_pd(C+1*lda,c2);
// print C
printf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);
return 0;
}
Spring 2011 -- Lecture #13
43
Inner loop from gcc –O -S
L2: movapd
movddup
mulpd
addpd
movddup
mulpd
addpd
addq
addq
cmpq
jne
movapd
movapd
6/27/2016
(%rax,%rsi), %xmm1
(%rdx), %xmm0
%xmm1, %xmm0
%xmm0, %xmm3
16(%rdx), %xmm0
%xmm0, %xmm1
%xmm1, %xmm2
$16, %rax
$8, %rdx
$32, %rax
L2
%xmm3, (%rcx)
%xmm2, (%rdi)
//Load aligned A[i,i+1]->m1
//Load B[j], duplicate->m0
//Multiply m0*m1->m0
//Add m0+m3->m3
//Load B[j+1], duplicate->m0
//Multiply m0*m1->m1
//Add m1+m2->m2
// rax+16 -> rax (i+=2)
// rdx+8 -> rdx (j+=1)
// rax == 32?
// jump to L2 if not equal
//store aligned m3 into C[k,k+1]
//store aligned m2 into C[l,l+1]
Spring 2011 -- Lecture #13
44
Performance-Driven ISA Extensions
• Subword parallelism, used primarily for multimedia
applications
– Intel MMX: multimedia extension
• 64-bit registers can hold multiple integer operands
– Intel SSE: Streaming SIMD extension
• 128-bit registers can hold several floating-point operands
• Adding instructions that do more work per cycle
–
–
–
–
6/27/2016
Shift-add: replace two instructions with one (e.g., multiply by 5)
Multiply-add: replace two instructions with one (x := c + a b)
Multiply-accumulate: reduce round-off error (s := s + a b)
Conditional copy: to avoid some branches (e.g., in if-then-else)
Spring 2011 -- Lecture #13
Slide 45
Big Idea: Amdahl’s (Heartbreaking) Law
• Speedup due to enhancement E is
Speedup w/ E =
Exec time w/o E
---------------------Exec time w/ E
• Suppose that enhancement E accelerates a fraction F (F <1)
of the task by a factor S (S>1) and the remainder of the task is
unaffected
Execution Time w/ E = Execution Time w/o E  [ (1-F) + F/S]
Speedup w/ E = 1 / [ (1-F) + F/S ]
6/27/2016
Fall 2010 -- Lecture #17
46
Big Idea: Amdahl’s Law
Speedup =
Example: the execution time of half of the
program can be accelerated by a factor of 2.
What is the program speed-up overall?
6/27/2016
Fall 2010 -- Lecture #17
47
Big Idea: Amdahl’s Law
Speedup =
Non-speed-up part
1
(1 - F) + F
S
Speed-up part
Example: the execution time of half of the
program can be accelerated by a factor of 2.
What is the program speed-up overall?
1
0.5 + 0.5
2
6/27/2016
=
1
=
0.5 + 0.25
Fall 2010 -- Lecture #17
1.33
48
Big Idea: Amdahl’s Law
If the portion of
the program that
can be parallelized
is small, then the
speedup is limited
The non-parallel
portion limits
the performance
6/27/2016
Fall 2010 -- Lecture #17
49
Example #1: Amdahl’s Law
Speedup w/ E = 1 / [ (1-F) + F/S ]
• Consider an enhancement which runs 20 times faster but
which is only usable 25% of the time
Speedup w/ E = 1/(.75 + .25/20) = 1.31
• What if its usable only 15% of the time?
Speedup w/ E = 1/(.85 + .15/20) = 1.17
• Amdahl’s Law tells us that to achieve linear speedup with
100 processors, none of the original computation can be
scalar!
• To get a speedup of 90 from 100 processors, the
percentage of the original program that could be scalar
would have to be 0.1% or less
Speedup w/ E = 1/(.001 + .999/100) = 90.99
6/27/2016
Fall 2010 -- Lecture #17
51
Example #2: Amdahl’s Law
Speedup w/ E = 1 / [ (1-F) + F/S ]
• Consider summing 10 scalar variables and two 10 by
10 matrices (matrix sum) on 10 processors
Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5
• What if there are 100 processors ?
Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0
• What if the matrices are 100 by 100 (or 10,010 adds in
total) on 10 processors?
Speedup w/ E = 1/(.001 + .999/10) = 1/0.1009 = 9.9
• What if there are 100 processors ?
Speedup w/ E = 1/(.001 + .999/100) = 1/0.01099 = 91
6/27/2016
Fall 2010 -- Lecture #17
54
Strong and Weak Scaling
• To get good speedup on a multiprocessor while
keeping the problem size fixed is harder than getting
good speedup by increasing the size of the problem.
– Strong scaling: when speedup can be achieved on a
parallel processor without increasing the size of the
problem
– Weak scaling: when speedup is achieved on a parallel
processor by increasing the size of the problem
proportionally to the increase in the number of processors
• Load balancing is another important factor: every
processor doing same amount of work
– Just 1 unit with twice the load of others cuts speedup
almost in half
6/27/2016
Fall 2010 -- Lecture #17
55
Review
• Flynn Taxonomy of Parallel Architectures
–
–
–
–
SIMD: Single Instruction Multiple Data
MIMD: Multiple Instruction Multiple Data
SISD: Single Instruction Single Data (unused)
MISD: Multiple Instruction Single Data
• Intel SSE SIMD Instructions
– One instruction fetch that operates on multiple operands
simultaneously
– 128/64 bit XMM registers
• SSE Instructions in C
– Embed the SSE machine instructions directly into C programs
through use of intrinsics
– Achieve efficiency beyond that of optimizing compiler
6/27/2016
Spring 2011 -- Lecture #13
56
Download