CS 61C: Great Ideas in Computer Architecture (Machine Structures) SIMD I Instructors: Randy H. Katz David A. Patterson http://inst.eecs.Berkeley.edu/~cs61c/sp11 6/27/2016 Spring 2011 -- Lecture #13 1 6/27/2016 Spring 2011 -- Lecture #13 2 Review • To access cache, Memory Address divided into 3 fields: Tag, Index, Block Offset • Cache size is Data + Management (tags, valid, dirty bits) • Write misses trickier to implement than reads – Write back vs. Write through – Write allocate vs. No write allocate • Cache Performance Equations: – CPU time = IC × CPIstall × CC = IC × (CPIideal + Memory-stall cycles) × CC – AMAT = Time for a hit + Miss rate x Miss penalty • If understand caches, can adapt software to improve cache performance and thus program performance 6/27/2016 Spring 2011 -- Lecture #12 3 New-School Machine Structures (It’s a bit more complicated!) Software • Parallel Requests Assigned to computer e.g., Search “Katz” Hardware Harness Smart Phone Warehouse Scale Computer • Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance Computer • Parallel Instructions >1 instruction @ one time e.g., 5 pipelined instructions Memory • Hardware descriptions All gates @ one time 6/27/2016 Today’s Lecture Core (Cache) Input/Output • Parallel Data >1 data item @ one time e.g., Add of 4 pairs of words … Core Instruction Unit(s) Core Functional Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Main Memory Logic Gates Spring 2011 -- Lecture #13 4 Agenda • • • • • • Flynn Taxonomy Administrivia DLP and SIMD Technology Break Intel SSE (Amdahl’s Law if time permits) 6/27/2016 Spring 2011 -- Lecture #13 5 Alternative Kinds of Parallelism: The Programming Viewpoint • Job-level parallelism/process-level parallelism – Running independent programs on multiple processors simultaneously – Example? • Parallel processing program – Single program that runs on multiple processors simultaneously – Example? 6/27/2016 Spring 2011 -- Lecture #13 6 Alternative Kinds of Parallelism: Hardware vs. Software • Concurrent software can also run on serial hardware • Sequential software can also run on parallel hardware • Focus is on parallel processing software: sequential or concurrent software running on parallel hardware 6/27/2016 Spring 2011 -- Lecture #13 7 Alternative Kinds of Parallelism: Single Instruction/Single Data Stream • Single Instruction, Single Data stream (SISD) Processing Unit 6/27/2016 – Sequential computer that exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are traditional uniprocessor machines Spring 2011 -- Lecture #13 8 Alternative Kinds of Parallelism: Multiple Instruction/Single Data Stream • Multiple Instruction, Single Data streams (MISD) 6/27/2016 – Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized. For example, certain kinds of array processors. – No longer commonly encountered, mainly of Spring 2011 -- Lecture #13 historical interest only 9 Alternative Kinds of Parallelism: Single Instruction/Multiple Data Stream • Single Instruction, Multiple Data streams (SIMD) – Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized, e.g., SIMD instruction extensions or Graphics Processing Unit (GPU) 6/27/2016 Spring 2011 -- Lecture #13 10 Alternative Kinds of Parallelism: Multiple Instruction/Multiple Data Streams • Multiple Instruction, Multiple Data streams (MIMD) – Multiple autonomous processors simultaneously executing different instructions on different data. – MIMD architectures include multicore and Warehouse Scale Computers – (Discuss after midterm) 6/27/2016 Spring 2011 -- Lecture #13 11 Flynn Taxonomy • In 2011, SIMD and MIMD most common parallel computers • Most common parallel processing programming style: Single Program Multiple Data (“SPMD”) – Single program that runs on all processors of an MIMD – Cross-processor execution coordination through conditional expressions (thread parallelism after midterm ) • SIMD (aka hw-level data parallelism): specialized function units, for handling lock-step calculations involving arrays – Scientific computing, signal processing, multimedia (audio/video processing) 6/27/2016 Spring 2011 -- Lecture #13 12 Data-Level Parallelism (DLP) (from 2nd lecture, January 20) • 2 kinds of DLP – Lots of data in memory that can be operated on in parallel (e.g., adding together 2 arrays) – Lots of data on many disks that can be operated on in parallel (e.g., searching for documents) • 2nd lecture (and 1st project) did DLP across 10s of servers and disks using MapReduce • Today’s lecture (and 3rd project) does Data Level Parallelism (DLP) in memory 6/27/2016 Spring 2011 -- Lecture #13 13 SIMD Architectures • Data parallelism: executing one operation on multiple data streams • Example to provide context: – Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n • Sources of performance improvement: – One instruction is fetched & decoded for entire operation – Multiplications are known to be independent – Pipelining/concurrency in memory access as well 6/27/2016 Spring 2011 -- Lecture #13 Slide 14 “Advanced Digital Media Boost” • To improve performance, Intel’s SIMD instructions – Fetch one instruction, do the work of multiple instructions – MMX (MultiMedia eXtension, Pentium II processor family) – SSE (Streaming SIMD Extension, Pentium III and beyond) 6/27/2016 Spring 2011 -- Lecture #13 15 Example: SIMD Array Processing for each f in array f = sqrt(f) for each f in { load f to calculate write the } array the floating-point register the square root result from the register to memory for each 4 members in array { load 4 members to the SSE register calculate 4 square roots in one operation write the result from the register to memory } 6/27/2016 Spring 2011 -- Lecture #13 16 Administrivia • Lab #7 posted • Midterm in 1 week: – Exam: Tu, Mar 8, 6-9 PM, 145/155 Dwinelle • Split: A-Lew in 145, Li-Z in 155 – – – – – Covers everything through lecture March 3 Closed book, can bring one sheet notes, both sides Copy of Green card will be supplied No phones, calculators, …; just bring pencils & eraser TA Review: Su, Mar 6, 2-5 PM, 2050 VLSB • Sent (anonymous) 61C midway survey before Midterm 6/27/2016 Spring 2011 -- Lecture #12 17 Scores on Project 2 Part 2 85 Score (Max 85) 75 65 55 • Top 25%: ≥79 / 85 45 35 • Next 50%: ≥60,<79 / 85 25 15 5 -5 0 0.2 0.4 0.6 0.8 1 Fraction of Students 6/27/2016 Spring 2011 -- Lecture #13 18 • Inclusive: all welcome,it works! – 82%: reaffirms CS major, will finish degree • – 30% ugrads, 40% grads, 30% • in beautiful San Francisco tapiaconference.org/2011 CDC is a joint org. of ACM,IEEE/CS,CRA If8 great you care, come! speakers on Grad School Success, •Workshops Volunteer poster for Early Career Success, Resume • Luminiaries: Deborah Estrin UCLA, student+opportunities Preparation BOFs • Banquet and Dance or import Blaise Aguera y Arcas Microsoft, (work remote Francisco Activity: Alcatraz Alan Eustace Google, Bill Wulf UVA, • Sanstudent) Tour, Chinatown, Bike over Golden Irving Wladawsky-Berger IBM, Bridge, … grad students to •Gate Encourage • If interested in diversity, by today John Kubiatowicz UC Berkeley apply doctoral consortium (3/1) email Sheila Humphrys with name, year, topic interest + 2 to 3 • Rising Stars: Hicks Rice, Howard sentences why want to go to Tapia Georgia Tech, Lopez Intel humphrys@EECS.Berkeley.EDU • General Chair: Dave Patterson 6/27/2016 Spring 2011 -- Lecture #11 http://tapiaconference.org/2011/participate.html 19 Hicks Kubiatowicz Eustace Wladawsky Howard Wulf Organizers Lopez Estrin Tapia Awardee Speakers Aguera y Arcas 6/27/2016 Patterson Taylor Spring 2011 -- Lecture #11 Tapia Lanius Vargas 20 Perez-Quinones Agenda • • • • • • Flynn Taxonomy Administrivia DLP and SIMD Technology Break Intel SSE (Amdahl’s Law if time permits) 6/27/2016 Spring 2011 -- Lecture #13 21 SSE Instruction Categories for Multimedia Support • SSE-2+ supports wider data types to allow 16 x 8-bit and 8 x 16-bit operands 6/27/2016 Spring 2011 -- Lecture #13 22 Intel Architecture SSE2+ 128-Bit SIMD Data Types 122 121 96 95 80 79 64 63 48 47 32 31 16 15 16 / 128 bits 122 121 96 95 80 79 64 63 48 47 32 31 16 15 8 / 128 bits 96 95 64 63 32 31 64 63 4 / 128 bits 2 / 128 bits • Note: in Intel Architecture (unlike MIPS) a word is 16 bits – Single precision FP: Double word (32 bits) – Double precision FP: Quad word (64 bits) 6/27/2016 Spring 2011 -- Lecture #13 23 XMM Registers • Architecture extended with eight 128-bit data registers: XMM registers – IA 64-bit address architecture: available as 16 64-bit registers (XMM8 – XMM15) – E.g., 128-bit packed single-precision floating-point data type (doublewords), allows four single-precision operations to be performed simultaneously 6/27/2016 Spring 2011 -- Lecture #13 24 SSE/SSE2 Floating Point Instructions xmm: one operand is a 128-bit SSE2 register mem/xmm: other operand is in memory or an SSE2 register {SS} Scalar Single precision FP: one 32-bit operand in a 128-bit register {PS} Packed Single precision FP: four 32-bit operands in a 128-bit register {SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register {PD} Packed Double precision FP, or two 64-bit operands in a 128-bit register {A} 128-bit operand is aligned in memory {U} means the 128-bit operand is unaligned in memory {H} means move the high half of the 128-bit operand {L} means move the low half of the 128-bit operand 6/27/2016 Spring 2011 -- Lecture #13 25 Example: Add Two Single Precision FP Vectors Computation to be performed: vec_res.x vec_res.y vec_res.z vec_res.w = = = = v1.x v1.y v1.z v1.w + + + + v2.x; v2.y; v2.z; v2.w; SSE Instruction Sequence: mov a ps : move from mem to XMM register, memory aligned, packed single precision add ps : add from mem to XMM register, packed single precision mov a ps : move from XMM register to mem, memory aligned, packed single precision movaps address-of-v1, %xmm0 // v1.w | v1.z | v1.y | v1.x -> xmm0 addps address-of-v2, %xmm0 // v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x -> xmm0 movaps %xmm0, address-of-vec_res 6/27/2016 Spring 2011 -- Lecture #13 26 Example: Image Converter • Converts BMP (bitmap) image to a YUV (color space) image format: – Read individual pixels from the BMP image, convert pixels into YUV format – Can pack the pixels and operate on a set of pixels with a single instruction • E.g., bitmap image consists of 8 bit monochrome pixels – Pack these pixel values in a 128 bit register (8 bit * 16 pixels), can operate on 16 values at a time – Significant performance boost 6/27/2016 Fall 2010 -- Lecture #18 28 Example: Image Converter • FMADDPS – Multiply and add packed single precision floating point instruction • One of the typical operations computed in transformations (e.g., DFT of FFT) N P = ∑ f(n) × x(n) n=1 6/27/2016 Spring 2011 -- Lecture #13 29 Example: Image Converter Floating point numbers f(n) and x(n) in src1 and src2; p in dest; C implementation for N = 4 (128 bits): for (int i =0; i< 4; i++) p = p + src1[i] * src2[i]; Regular x86 instructions for the inner loop: //src1 is on the top of the stack; src1 * src2 -> src1 fmul DWORD PTR _src2$[%esp+148] //p = ST(1), src1 = ST(0); ST(0)+ST(1) -> ST(1); ST-Stack Top faddp %ST(0), %ST(1) (Note: Destination on the right in x86 assembly) Number regular x86 Fl. Pt. instructions executed: 4 * 2 = 8 6/27/2016 Spring 2011 -- Lecture #13 30 Example: Image Converter Floating point numbers f(n) and x(n) in src1 and src2; p in dest; C implementation for N = 4 (128 bits): for (int i =0; i< 4; i++) p = p + src1[i] * src2[i]; • SSE2 instructions for the inner loop: //xmm0 = p, xmm1 = src1[i], xmm2 = src2[i] mulps %xmm1, %xmm2 // xmm2 * xmm1 -> xmm2 addps %xmm2, %xmm0 // xmm0 + xmm2 -> xmm0 • Number regular instructions executed: 2 SSE2 instructions vs. 8 x86 • SSE5 instruction accomplishes same in one instruction: fmaddps %xmm0, %xmm1, %xmm2, %xmm0 // xmm2 * xmm1 + xmm0 -> xmm0 // multiply xmm1 x xmm2 paired single, // then add product paired single to sum in xmm0 • Number regular instructions executed: 1 SSE5 instruction vs. 8 x86 6/27/2016 Spring 2011 -- Lecture #13 31 Intel SSE Intrinsics • Intrinsics are C functions and procedures for putting in assembly language, including SSE instructions – With intrinsics, can program using these instructions indirectly – One-to-one correspondence between SSE instructions and intrinsics 6/27/2016 Spring 2011 -- Lecture #13 32 Example SSE Intrinsics Instrinsics: Corresponding SSE instructions: • Vector data type: _m128d • Load and store operations: _mm_load_pd MOVAPD/aligned, packed double _mm_store_pd MOVAPD/aligned, packed double _mm_loadu_pd MOVUPD/unaligned, packed double _mm_storeu_pd MOVUPD/unaligned, packed double • Load and broadcast across vector _mm_load1_pd MOVSD + shuffling/duplicating • Arithmetic: _mm_add_pd ADDPD/add, packed double _mm_mul_pd MULPD/multiple, packed double 02/09/2010 6/27/2016 CS267 Spring 2011Lecture -- Lecture7#13 33 33 Example: 2 x 2 Matrix Multiply Definition of Matrix Multiply: 2 Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j k=1 A1,1 A1,2 B1,1 B1,2 x A2,1 6/27/2016 A2,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2 = B2,1 B2,2 Spring 2011 -- Lecture #13 34 Example: 2 x 2 Matrix Multiply • Using the XMM registers – 64-bit/double precision/two doubles per XMM reg C1 C1,1 C2,1 C2 C1,2 C2,2 A A1,i A2,i B1 Bi,1 Bi,1 B2 Bi,2 Bi,2 6/27/2016 Stored in memory in Column order Spring 2011 -- Lecture #13 35 Example: 2 x 2 Matrix Multiply • Initialization C1 0 0 C2 0 0 A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order B1 B1,1 B1,1 B2 B1,2 B1,2 _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register • I=1 6/27/2016 Spring 2011 -- Lecture #13 36 Example: 2 x 2 Matrix Multiply • Initialization C1 0 0 C2 0 0 A A1,1 A2,1 _mm_load_pd: Load 2 doubles into XMM reg, Stored in memory in Column order B1 B1,1 B1,1 B2 B1,2 B1,2 _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) • I=1 6/27/2016 Spring 2011 -- Lecture #13 37 Example: 2 x 2 Matrix Multiply • First iteration intermediate result C1 0+A1,1B1,1 0+A2,1B1,1 C2 0+A1,1B1,2 0+A2,1B1,2 • I=1 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers A A1,1 A2,1 _mm_load_pd: Stored in memory in Column order B1 B1,1 B1,1 B2 B1,2 B1,2 _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 6/27/2016 Spring 2011 -- Lecture #13 38 Example: 2 x 2 Matrix Multiply • First iteration intermediate result C1 0+A1,1B1,1 0+A2,1B1,1 C2 0+A1,1B1,2 0+A2,1B1,2 • I=2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers A A1,2 A2,2 _mm_load_pd: Stored in memory in Column order B1 B2,1 B2,1 B2 B2,2 B2,2 _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 6/27/2016 Spring 2011 -- Lecture #13 39 Example: 2 x 2 Matrix Multiply • Second iteration intermediate result C1 C2 C2,1 C1,1 A1,1B1,1+A1,2B2,1 A2,1B1,1+A2,2B2,1 A1,1B1,2+A1,2B2,2 A2,1B1,2+A2,2B2,2 C2,2 C1,2 • I=2 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers A A1,2 A2,2 _mm_load_pd: Stored in memory in Column order B1 B2,1 B2,1 B2 B2,2 B2,2 _mm_load1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 6/27/2016 Spring 2011 -- Lecture #13 40 Live Example: 2 x 2 Matrix Multiply Definition of Matrix Multiply: 2 Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j k=1 A1,1 A1,2 B1,1 B1,2 x C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2 = A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2 1 0 1 3 C1,1= 1*1 + 0*2 = 1 C1,2= 1*3 + 0*4 = 3 C2,1= 0*1 + 1*2 = 2 C2,2= 0*3 + 1*4 = 4 x 0 6/27/2016 1 = 2 4 Spring 2011 -- Lecture #13 41 Example: 2 x 2 Matrix Multiply (Part 1 of 2) #include <stdio.h> // header file for SSE compiler intrinsics #include <emmintrin.h> // NOTE: vector registers will be represented in comments as v1 = [ a | b] // where v1 is a variable of type __m128d and a, b are doubles int main(void) { // allocate A,B,C aligned on 16-byte boundaries double A[4] __attribute__ ((aligned (16))); double B[4] __attribute__ ((aligned (16))); double C[4] __attribute__ ((aligned (16))); int lda = 2; int i = 0; // declare several 128-bit vector variables __m128d c1,c2,a,b1,b2; 6/27/2016 // Initialize A, B, C for example /* A = (note column order!) 10 01 */ A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0; /* B = (note column order!) 13 24 */ B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0; /* C = (note column order!) 00 00 */ C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0; Spring 2011 -- Lecture #13 42 Example: 2 x 2 Matrix Multiply (Part 2 of 2) // used aligned loads to set // c1 = [c_11 | c_21] c1 = _mm_load_pd(C+0*lda); // c2 = [c_12 | c_22] c2 = _mm_load_pd(C+1*lda); for (i = 0; i < 2; i++) { /* a = i = 0: [a_11 | a_21] i = 1: [a_12 | a_22] */ a = _mm_load_pd(A+i*lda); /* b1 = i = 0: [b_11 | b_11] i = 1: [b_21 | b_21] */ b1 = _mm_load1_pd(B+i+0*lda); /* b2 = i = 0: [b_12 | b_12] i = 1: [b_22 | b_22] */ b2 = _mm_load1_pd(B+i+1*lda); 6/27/2016 /* c1 = i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11] i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21] */ c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1)); /* c2 = i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12] i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22] */ c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2)); } // store c1,c2 back into C for completion _mm_store_pd(C+0*lda,c1); _mm_store_pd(C+1*lda,c2); // print C printf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]); return 0; } Spring 2011 -- Lecture #13 43 Inner loop from gcc –O -S L2: movapd movddup mulpd addpd movddup mulpd addpd addq addq cmpq jne movapd movapd 6/27/2016 (%rax,%rsi), %xmm1 (%rdx), %xmm0 %xmm1, %xmm0 %xmm0, %xmm3 16(%rdx), %xmm0 %xmm0, %xmm1 %xmm1, %xmm2 $16, %rax $8, %rdx $32, %rax L2 %xmm3, (%rcx) %xmm2, (%rdi) //Load aligned A[i,i+1]->m1 //Load B[j], duplicate->m0 //Multiply m0*m1->m0 //Add m0+m3->m3 //Load B[j+1], duplicate->m0 //Multiply m0*m1->m1 //Add m1+m2->m2 // rax+16 -> rax (i+=2) // rdx+8 -> rdx (j+=1) // rax == 32? // jump to L2 if not equal //store aligned m3 into C[k,k+1] //store aligned m2 into C[l,l+1] Spring 2011 -- Lecture #13 44 Performance-Driven ISA Extensions • Subword parallelism, used primarily for multimedia applications – Intel MMX: multimedia extension • 64-bit registers can hold multiple integer operands – Intel SSE: Streaming SIMD extension • 128-bit registers can hold several floating-point operands • Adding instructions that do more work per cycle – – – – 6/27/2016 Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a b) Multiply-accumulate: reduce round-off error (s := s + a b) Conditional copy: to avoid some branches (e.g., in if-then-else) Spring 2011 -- Lecture #13 Slide 45 Big Idea: Amdahl’s (Heartbreaking) Law • Speedup due to enhancement E is Speedup w/ E = Exec time w/o E ---------------------Exec time w/ E • Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Execution Time w/ E = Execution Time w/o E [ (1-F) + F/S] Speedup w/ E = 1 / [ (1-F) + F/S ] 6/27/2016 Fall 2010 -- Lecture #17 46 Big Idea: Amdahl’s Law Speedup = Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 6/27/2016 Fall 2010 -- Lecture #17 47 Big Idea: Amdahl’s Law Speedup = Non-speed-up part 1 (1 - F) + F S Speed-up part Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 1 0.5 + 0.5 2 6/27/2016 = 1 = 0.5 + 0.25 Fall 2010 -- Lecture #17 1.33 48 Big Idea: Amdahl’s Law If the portion of the program that can be parallelized is small, then the speedup is limited The non-parallel portion limits the performance 6/27/2016 Fall 2010 -- Lecture #17 49 Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] • Consider an enhancement which runs 20 times faster but which is only usable 25% of the time Speedup w/ E = 1/(.75 + .25/20) = 1.31 • What if its usable only 15% of the time? Speedup w/ E = 1/(.85 + .15/20) = 1.17 • Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! • To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 6/27/2016 Fall 2010 -- Lecture #17 51 Example #2: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] • Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5 • What if there are 100 processors ? Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0 • What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? Speedup w/ E = 1/(.001 + .999/10) = 1/0.1009 = 9.9 • What if there are 100 processors ? Speedup w/ E = 1/(.001 + .999/100) = 1/0.01099 = 91 6/27/2016 Fall 2010 -- Lecture #17 54 Strong and Weak Scaling • To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem. – Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem – Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors • Load balancing is another important factor: every processor doing same amount of work – Just 1 unit with twice the load of others cuts speedup almost in half 6/27/2016 Fall 2010 -- Lecture #17 55 Review • Flynn Taxonomy of Parallel Architectures – – – – SIMD: Single Instruction Multiple Data MIMD: Multiple Instruction Multiple Data SISD: Single Instruction Single Data (unused) MISD: Multiple Instruction Single Data • Intel SSE SIMD Instructions – One instruction fetch that operates on multiple operands simultaneously – 128/64 bit XMM registers • SSE Instructions in C – Embed the SSE machine instructions directly into C programs through use of intrinsics – Achieve efficiency beyond that of optimizing compiler 6/27/2016 Spring 2011 -- Lecture #13 56