0907532 Special Topics in Computer Engineering Multicore Architecture Basics 1 Basic concept of parallelism • The idea is simple: improve performance by performing two or more operations at the same time. • Has been an important computer design strategy since the beginning. 2 Parallelism in This Course (muticore machines) • Attain parallelism by using several processing elements (cores) on the same chip or on different chips sharing main memory. • Parallel computing is necessary for continuing performance gains given that “clock speeds are not going to increase dramatically 3 Clock Rate (GHz) 2005 IT Roadmap Semiconductors 25 2005 Roadmap 20 15 10 Intel single core 5 0 2001 2003 2005 2007 2009 2011 2013 4 Clock Rate (GHz) Change in ITS Roadmap in 2 yrs 25 2005 Roadmap 20 15 10 2007 Roadmap Intel single core 5 Intel multicore 0 2001 2003 2005 2007 2009 2011 2013 5 Shared Address Space Architectures • Any core can directly reference any memory location • Communication between cores occurs implicitly as result of loads and stores 6 Memory hierarchy and cache memories: 1. Review concepts assuming “Single Core” 2. Introduce problems and solution when used in “Multicore Machines” 7 Single core memory hierarchy and cache memories • Programs tend to exhibit temporal and spatial locality: • Temporal locality: Once programs access data items or instructions they tend to access them again in the near future. • Spatial locality: Once programs access data items or instruction, they tend to access nearby data items or instruction in the near future. • Because of the locality property of programs, memory is organized in a hierarchy. 8 Memory hierarchy ~ 1’s Cycle Key Observations Core ~ 1’s – 10 Cycles L1 Cache L2 Cache Access to L1 cache is on order of 1 cycle Access to L2 on order of 1 to 10 cycles Access to Main memory ~ 100’s cycles Access to Disk ~ 1000’s cycles ~ 100’s Cycles Main Memory Magnetic Disk ~ 1000’s Cycles Connecting lines thickness depict bandwidth: Bytes/Second Processor and Memory are Far Apart memory interconnect processor Art of Multiprocessor Programming 10 Reading from Memory address Art of Multiprocessor Programming 11 Reading from Memory zzz… Art of Multiprocessor Programming 12 Reading from Memory value Art of Multiprocessor Programming 13 Writing to Memory address, value Art of Multiprocessor Programming 14 Writing to Memory zzz… Art of Multiprocessor Programming 15 Writing to Memory ack Art of Multiprocessor Programming 16 Cache: Reading from Memory address cache Art of Multiprocessor Programming 17 Cache: Reading from Memory cache Art of Multiprocessor Programming 18 Cache: Reading from Memory cache Art of Multiprocessor Programming 19 Cache Hit ? cache Art of Multiprocessor Programming 20 Cache Hit Yes! cache Art of Multiprocessor Programming 21 Cache Miss address No… ? cache Art of Multiprocessor Programming 22 Cache Miss cache Art of Multiprocessor Programming 23 Cache Miss cache Art of Multiprocessor Programming 24 Memory and cache performance metrics • Cache Hit and Miss : When the data is found in the cache, we have a cache hit, otherwise it is a miss. • Hit Ratio ,HR = fraction of memory references that hit – Depends on locality of application – Measure of effectiveness of caching mechanism • Miss Ratio , MR= fraction of memory references that miss • HR = 1- MR 25 Average memory system access time If all the data fits in main memory (i.e. ignore desk access) HR * cache access time + MR * main memory access time 26 Cache line • When there is a cache miss, a fixed size block of consecutive data elements, or line, is copied from main memory to the cache. • Typical cache line size is 4-128 bytes. • Main memory can be seen as a sequence of lines, some of which can have a copy in the cache. 27 MEMORY HIERARCHY AND BANDWIDTH ON MULTICORE • Each core has its own private cache, L1 cache to provide fast access, e.g. 1-2 cycles. • L2 caches may be shared across multiple cores. • In the event of cache miss at both L1 and L2, the memory controller must forward a load/store request to the off-chip main memory. 28 Intel® Core™ Microarchitecture – Memory Sub-system High Level Multicore Architectural view Intel Core 2 Duo Processor A A E E C B Intel Core 2 Quad Processor A A A A E E E E C1 C2 B B Dual Core has shared cache 64B Cache Line has both shared 64B Cache Line Quad core Memory And separated Memory cache A = Architectural State C = 2nd Level Cache E = Execution Engine & Interrupt B = Bus Interface connects to main memory & I/O Cache line ping-ponging or tennis effect • One processor writes to a cache line and then another processor writes to the same cache line but different data element • Cash line is in a separate socket/separate L2 cache environment • Each core would take a HITM (HIT Modified) on the cache line causing it to ship across the FSB (Front Side Bus to memory) • This increases the FSB traffic and even in good conditions costs about ½ the cost of a memory access 30 Intel® Core™ Microarchitecture – Memory Sub-system With a separated cache Memory Front Side Bus (FSB) Shipping L2 Cache Line ~Half access to memory Cache Line CPU1 CPU2 Intel® Core™ Microarchitecture – Memory Sub-system Advantages of Shared Cache – using Advanced Smart Cache® Technology Memory Front Side Bus (FSB) L2 is shared: No need to ship cache line Cache Line CPU1 CPU2 False Sharing Performance issue in programs where cores may write to different memory addresses BUT in the same cache lines Known as Ping-Ponging – Cache line is shipped between cores Core 0 X[0] = 0 Core 1 X[1] = 0 X[0] = 1 Time X[1] = 1 False X[0] = 2 Sharing not an issue in shared cache 1 0 0 2 It is an issue 1 in 1 separated cache Avoiding False Sharing Change either • Algorithm – adjust the implementation of the algorithm (the loop stride) to access data in different cache line for each thread Or • Data Structure: – add some “padding” to a data structure or arrays ( just enough padding generally less than cache line size) so that threads access data from different cache lines. 34