CS 61C: Great Ideas in Computer Architecture (Machine Structures) SIMD II Instructors: Randy H. Katz David A. Patterson http://inst.eecs.Berkeley.edu/~cs61c/sp11 6/27/2016 Spring 2011 -- Lecture #14 1 New-School Machine Structures (It’s a bit more complicated!) Software • Parallel Requests Assigned to computer e.g., Search “Katz” Hardware Harness Smart Phone Warehouse Scale Computer • Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance Computer • Parallel Instructions >1 instruction @ one time e.g., 5 pipelined instructions Memory • Hardware descriptions All gates @ one time 6/27/2016 Today’s Lecture Core (Cache) Input/Output • Parallel Data >1 data item @ one time e.g., Add of 4 pairs of words … Core Instruction Unit(s) Core Functional Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Main Memory Logic Gates Spring 2011 -- Lecture #14 3 Review • Flynn Taxonomy of Parallel Architectures – – – – SIMD: Single Instruction Multiple Data MIMD: Multiple Instruction Multiple Data SISD: Single Instruction Single Data (unused) MISD: Multiple Instruction Single Data • Intel SSE SIMD Instructions – One instruction fetch that operates on multiple operands simultaneously – 128/64 bit XMM registers • SSE Instructions in C – Embed the SSE machine instructions directly into C programs through use of intrinsics – Achieve efficiency beyond that of optimizing compiler 6/27/2016 Spring 2011 -- Lecture #14 4 Agenda • • • • • • Amdahl’s Law Administrivia SIMD and Loop Unrolling Technology Break Memory Performance for Caches Review of 1st Half of 61C 6/27/2016 Spring 2011 -- Lecture #14 5 Big Idea: Amdahl’s (Heartbreaking) Law • Speedup due to enhancement E is Speedup w/ E = Exec time w/o E ---------------------Exec time w/ E • Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Execution Time w/ E = Execution Time w/o E [ (1-F) + F/S] Speedup w/ E = 1 / [ (1-F) + F/S ] 6/27/2016 Fall 2010 -- Lecture #17 6 Big Idea: Amdahl’s Law Speedup = Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 6/27/2016 Fall 2010 -- Lecture #17 7 Big Idea: Amdahl’s Law Speedup = Non-speed-up part 1 (1 - F) + F S Speed-up part Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 1 0.5 + 0.5 2 6/27/2016 = 1 = 0.5 + 0.25 Fall 2010 -- Lecture #17 1.33 8 Big Idea: Amdahl’s Law If the portion of the program that can be parallelized is small, then the speedup is limited The non-parallel portion limits the performance 6/27/2016 Fall 2010 -- Lecture #17 9 Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] • Consider an enhancement which runs 20 times faster but which is only usable 25% of the time Speedup w/ E = 1/(.75 + .25/20) = 1.31 • What if its usable only 15% of the time? Speedup w/ E = 1/(.85 + .15/20) = 1.17 • Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! • To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/(.001 + .999/100) = 90.99 6/27/2016 Fall 2010 -- Lecture #17 11 Parallel Speed-up Example Z0 + Z1 + … + Z10 X1,1 X1,10 Y1,1 Y1,10 Y10,1 Y10,10 + X10,1 Non-parallel part X10,10 Partition 10 ways and perform on 10 parallel processing units Parallel part • 10 “scalar” operations (non-parallelizable) • 100 parallelizable operations • 110 operations – 100/110 = .909 Parallelizable, 10/110 = 0.91 Scalar 6/27/2016 Fall 2010 -- Lecture #17 12 Example #2: Amdahl’s Law Speedup w/ E = 1 / [ (1-F) + F/S ] • Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5 • What if there are 100 processors ? Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0 • What if the matrices are 33 by 33(or 1019 adds in total) on 10 processors? (increase parallel data by 10x) Speedup w/ E = 1/(.009 + .991/10) = 1/0.108 = 9.2 • What if there are 100 processors ? Speedup w/ E = 1/(.009 + .991/100) = 1/0.019 = 52.6 6/27/2016 Fall 2010 -- Lecture #17 14 Strong and Weak Scaling • To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem. – Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem (e.g., 10x10 Matrix on 10 processors to 100) – Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors – (e.g., 10x10 Matrix on 10 processors =>33x33 Matrix on 100) • Load balancing is another important factor: every processor doing same amount of work – Just 1 unit with twice the load of others cuts speedup almost in half 6/27/2016 Fall 2010 -- Lecture #17 15 Peer Review • Suppose a program spends 80% of its time in a square root routine. How much must you speedup square root to make the program run 5 times faster? Speedup w/ E = 1 / [ (1-F) + F/S ] A red) 4 B orange) 5 C green) 10 20 E pink) 6/27/2016 None of the above Spring 2011 -- Lecture #14 16 Administrivia • • • • Lab #7 posted No Homework, no project this week! TA Review: Su, Mar 6, 2-5 PM, 2050 VLSB Midterm Exam: Tu, Mar 8, 6-9 PM, 145/155 Dwinelle – Split: A-Lew in 145, Li-Z in 155 – Small number of special consideration cases, due to class conflicts, etc.—contact Dave or Randy • No discussion during exam week; no lecture that day • Sent (anonymous) 61C midway survey before Midterm: Please fill out! (Only 1/3 so far; have your voice heard!) • https://www.surveymonkey.com/s/QS3ZLW7 6/27/2016 Spring 2011 -- Lecture #14 17 61C in the News “Remapping Computer Circuitry to Avert Impending Bottlenecks,” John Markoff, NY Times, Feb 28, 2011 Hewlett-Packard researchers have proposed a fundamental rethinking of the a point in time where more than five decades of progress in continuously modern computer for the coming era of nanoelectronics — a marriage of memory shrinking the size of transistors used in computation will end. and computing power that could … systems will be based on memory drastically limit the energy used by chips he calls “nanostores” as distinct computers. from today’s microprocessors. They will Today the microprocessor is in the center be hybrids, three-dimensional systems in of the computing universe, and which lower-level circuits will be based information is moved, at heavy energy on a nanoelectronic technology called cost, first to be used in computation and the memristor, which Hewlett-Packard is then stored. The new approach would be developing to store data. to marry processing to memory to cut The nanostore chips will have a down transportation of data and reduce multistory design, and computing circuits energy use. made with conventional silicon will sit The semiconductor industry has long directly on top of the memory to process warned about a set of impending the data, with minimal energy costs. bottlenecks described as “the wall,” 6/27/2016 Spring 2011 -- Lecture #14 18 Getting to Know Profs • Ride with sons in MS Charity Bike Ride every September since 2002 • • “Waves to Wine” 150 miles over 2 days from SF to Sonoma • Team: “Berkeley Anti-MS Crew” • • If want to join team, let me know Always a Top 10 fundraising team despite small size • I was top fundraiser 2006, 2007, 2008, 2009, 2010 due to computing – Can offer fund raising advice: order of sending, when to send during week, who to send to, … Agenda • • • • • • Amdahl’s Law Administrivia SIMD and Loop Unrolling Technology Break Memory Performance for Caches Review of 1st Half of 61C 6/27/2016 Spring 2011 -- Lecture #14 20 Data Level Parallelism and SIMD • SIMD wants adjacent values in memory that can be operated in parallel • Usually specified in programs as loops for(i=1000; i>0; i=i-1) x[i] = x[i] + s; • How can reveal more data level parallelism than available in a single iteration of a loop? • Unroll loop and adjust iteration rate 6/27/2016 Spring 2011 -- Lecture #14 21 Looping in MIPS Assumptions: - R1 is initially the address of the element in the array with the highest address - F2 contains the scalar value s - 8(R2) is the address of the last element to operate on. CODE: Loop:1. l.d F0, 0(R1) ; F0=array element 2. add.d F4,F0,F2 ; add s to F0 3. s.d F4,0(R1) ; store result 4. addui R1,R1,#-8 ; decrement pointer 8 byte 5. bne R1,R2,Loop ;repeat loop if R1 != R2 Loop Unrolled Loop: l.d add.d s.d l.d add.d s.d l.d add.d s.d l.d add.d s.d addui bne F0,0(R1) F4,F0,F2 F4,0(R1) F6,-8(R1) F8,F6,F2 F8,-8(R1) F10,-16(R1) F12,F10,F2 F12,-16(R1) F14,-24(R1) F16,F14,F2 F16,-24(R1) R1,R1,#-32 R1,R2,Loop NOTE: 1. Different Registers eliminate stalls 2. Only 1 Loop Overhead every 4 iterations 3. This unrolling works if loop_limit(mod 4) = 0 Loop Unrolled Scheduled Loop:l.d l.d l.d l.d add.d add.d add.d add.d s.d s.d s.d s.d addui bne F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F4,0(R1) F8,-8(R1) F12,-16(R1) F16,-24(R1) R1,R1,#-32 R1,R2,Loop 4 Loads side-by-side: Could replace with 4 wide SIMD Load 4 Adds side-by-side: Could replace with 4 wide SIMD Add 4 Stores side-by-side: Could replace with 4 wide SIMD Store Loop Unrolling in C • Instead of compiler doing loop unrolling, could do it yourself in C for(i=1000; i>0; i=i-1) x[i] = x[i] + s; What is downside of doing it in C? • Could be rewritten for(i=1000; i>0; i=i-4) { x[i] = x[i] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s; } 6/27/2016 Spring 2011 -- Lecture #14 25 Generalizing Loop Unrolling • A loop of n iterations • k copies of the body of the loop Then we will run the loop with 1 copy of the body n(mod k) times and with k copies of the body floor(n/k) times • (Will revisit loop unrolling again when get to pipelining later in semester) Agenda • • • • • • Amdahl’s Law Administrivia SIMD and Loop Unrolling Memory Performance for Caches Technology Break Review of 1st Half of 61C 6/27/2016 Spring 2011 -- Lecture #14 27 Reading Miss Penalty: Memory Systems that Support Caches • The off-chip interconnect and memory architecture on-chip affects overall system performance in dramatic ways CPU One word wide organization (one word wide bus and one word wide memory) Assume Cache 32-bit data & 32-bit addr per cycle • • bus DRAM Memory • 1 memory bus clock cycle to send address 15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle time), 5 memory bus clock cycles for 2nd, 3rd, 4th words (subsequent column access time)—note effect of latency! 1 memory bus clock cycle to return a word of data Memory-Bus to Cache bandwidth • 6/27/2016 Number of bytes accessed from memory and transferred to cache/CPU per memory bus clock cycle Spring 2011 -- Lecture #11 28 (DDR) SDRAM Operation After a row is read into the SRAM register • Input CAS as the starting “burst” address along with a burst length • Transfers a burst of data (ideally a cache block) from a series of sequential addresses within that row +1 N cols DRAM N rows • Column Address - Memory bus clock controls transfer of successive words in the burst Cycle Time 1st M-bit Access Row Address N x M SRAM M bit planes M-bit Output 2nd M-bit 3rd M-bit 4th M-bit RAS CAS Row Address 6/27/2016 Col Address Row Add Spring 2011 -- Lecture #11 29 One Word Wide Bus, One Word Blocks on-chip CPU Cache • If block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory 1 memory bus clock cycle to send address bus DRAM Memory 15 memory bus clock cycles to read DRAM 1 memory bus clock cycle to return data 17 total clock cycles miss penalty • Number of bytes transferred per clock cycle (bandwidth) for a single miss is 4/17 = 0.235 bytes per memory bus clock 6/27/2016 cycle Spring 2011 -- Lecture #11 31 One Word Wide Bus, Four Word Blocks on-chip CPU • What if block size is four words and each word is in a different DRAM row? cycle to send 1st address 4 x 15 = 60 cycles to read DRAM 1 cycles to return last data word 62 total clock cycles miss penalty 1 Cache bus 15 cycles DRAM Memory 15 cycles 15 cycles 15 cycles • Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/62 = 0.258 6/27/2016 Spring 2011 -- Lecture #11 bytes per clock 33 One Word Wide Bus, Four Word Blocks on-chip CPU • What if the block size is four words and all words are in the same DRAM row? cycle to send 1st address 15 + 3*5 = 30 cycles to read DRAM 1 cycles to return last data word 32 total clock cycles miss penalty 1 Cache bus 15 cycles 5 cycles DRAM Memory 5 cycles 5 cycles • Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/32 = 0.5 bytes per clock 6/27/2016 Spring 2011 -- Lecture #11 35 Interleaved Memory, One Word Wide Bus • on-chip CPU Cache For a block size of four words 1 cycle to send 1st address 15 cycles to read DRAM banks 4*1 = 4 cycles to return last data word 20 total clock cycles miss penalty 15 cycles bus 15 cycles 15 cycles DRAM DRAM DRAM DRAM Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 15 cycles • Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/20 = 0.8 bytes per clock 6/27/2016 Spring 2011 -- Lecture #11 37 DRAM Memory System Observations • Its important to match the cache characteristics – Caches access one block at a time (usually more than one word) 1) With the DRAM characteristics – Use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache 2) With the memory-bus characteristics – Make sure the memory-bus can support the DRAM access rates and patterns – With the goal of increasing the Memory-Bus to Cache bandwidth 6/27/2016 Spring 2011 -- Lecture #11 38 Agenda • • • • • • Amdahl’s Law Administrivia SIMD and Loop Unrolling Memory Performance for Caches Technology Break Review of 1st Half of 61C 6/27/2016 Spring 2011 -- Lecture #14 39 New-School Machine Structures (It’s a bit more complicated!) Project 1 Software • Parallel Requests Assigned to computer e.g., Search “Katz” Hardware Smart Phone Warehouse Scale Computer Harness • Parallel Threads Parallelism & Assigned to core e.g., Lookup, Ads Achieve High Performance Project 2 • Parallel Instructions >1 instruction @ one time e.g., 5 pipelined instructions • Parallel Data >1 data item @ one time e.g., Add of 4 pairs of words • Hardware descriptions All gates functioning in parallel at same time 6/27/2016 Computer … Core Memory Core (Cache) Input/Output Instruction Unit(s) Project 3 Core Functional Unit(s) A0+B0 A1+B1 A2+B2 A3+B3 Main Memory Spring 2011 -- Lecture #1 Logic Gates Project 40 4 6 Great Ideas in Computer Architecture 1. 2. 3. 4. 5. 6. Layers of Representation/Interpretation Moore’s Law Principle of Locality/Memory Hierarchy Parallelism Performance Measurement & Improvement Dependability via Redundancy 6/27/2016 Spring 2011 -- Lecture #1 41 Great Idea #1: Levels of First half 61C Representation/Interpretation High Level Language Program (e.g., C) Compiler Assembly Language Program (e.g., MIPS) Assembler Machine Language Program (MIPS) temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; lw lw sw sw 0000 1010 1100 0101 $t0, 0($2) $t1, 4($2) $t1, 0($2) $t0, 4($2) 1001 1111 0110 1000 1100 0101 1010 0000 Anything can be represented as a number, i.e., data or instructions 0110 1000 1111 1001 1010 0000 0101 1100 1111 1001 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 Machine Interpretation Hardware Architecture Description (e.g., block diagrams) Architecture Implementation Logic Circuit Description (Circuit Schematic Diagrams)Spring 2011 -- Lecture #1 6/27/2016 42 Predicts: 2X Transistors / chip every 2 years # of transistors on an integrated circuit (IC) #2: Moore’s Law Gordon Moore Intel Cofounder B.S. Cal 1950! 6/27/2016 Spring 2011 -- Lecture #1 Year 43 Great Idea #3: Principle of Locality/ First half 61C Memory Hierarchy 6/27/2016 Spring 2011 -- Lecture #1 44 Great Idea #4: Parallelism • Data Level Parallelism in 1st half 61C – Lots of data in memory that can be operated on in parallel (e.g., adding together 2 arrays) – Lots of data on many disks that can be operated on in parallel (e.g., searching for documents) • 1st project: DLP across 10s of servers and disks using MapReduce • Next week’s lab, 3rd project: DLP in memory 6/27/2016 Spring 2011 -- Lecture #1 45 6/27/2016 Spring 2011 -- Lecture #1 46 6/27/2016 Fall 2010 -- Lecture #40 47 Summary • • • • • Amdhal’s Cruel Law: Law of Diminishing Returns Loop Unrolling to Expose Parallelism Optimize Miss Penalty via Memory system As the field changes, cs61c has to change too! Still about the software-hardware interface – Programming for performance via measurement! – Understanding the memory hierarchy and its impact on application performance – Unlocking the capabilities of the architecture for performance: SIMD 6/27/2016 Fall 2010 -- Lecture #40 48