Lecture 17: Embedded Multiprocessor Memory Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf High Performance Embedded Computing © 2007 Elsevier Topics Parallel memory systems. Models for memory Heterogeneous memory systems Consistent parallel memory systems ARM MPCore multiprocessor © 2006 Elsevier Parallel memory systems n memory banks can be accessed independently. Peak access rate given by n parallel accesses. If is the probability of a non-sequential access Bank 0 address data Probability of k sequential accesses is p(k ) (1 ) k 1 Mean length of sequential accesses is Lb k p(k ) k 1 Bank 1 1 © 2006 Elsevier Bank 2 Bank 3 Memory system design Design parallel memory systems using previous memory component models (CH 2) Parameters: Delay is a nonlinear function of memory size. Area – Size of the component Performance - Access time of the component – may differ for reads vs. writes,, page mode, etc. Energy per access – may also differ Bit line delays can dominate access time Delay is a nonlinear function of the number of ports. © 2006 Elsevier Heterogeneous memory systems Heterogeneous memory improves real-time performance: Heterogeneous memory improves power: Accesses to the same bank interfere, even if not to the same location. Segregating real-time locations improves predictability, reduces access time variance. Smaller blocks with fewer ports consume less energy. What are disadvantages of heterogeneous memory systems? © 2006 Elsevier Memory system design methodology [Dut98] © 2006 Elsevier [Dut98] © 1998 IEEE Motion Estimation Architecture [Dut98] © 1998 IEEE © 2006 Elsevier Memory Partitioning and Delay © 2006 Elsevier [Dut98] © 1998 IEEE Critical Sections and Locks [Akg02] Critical Section Lock Delay Time between release and acquisition of a lock Lock Latency Code section where shared data is accessed Lock helps guarantee the consistency of shared data (e.g., global variables) Time to acquire a lock when no contention Approach Provide SoC lock cache © 2006 Elsevier SoC Lock Cache Mechanism Locks to shared code sections are stored in a dedicated lock cache Locks appear in processors address space Access using load/store instructions © 2006 Elsevier SoC Lock Cache Features Simple hardware mechanism: SoCLC No modifications/extensions to processor core or to caches No special instructions or atomic primitives Can be integrated as an intellectual property (IP) block into the SoC Hardware interrupt triggered notification © 2006 Elsevier SW Only vs. HW/SW Locks © 2006 Elsevier SW Only vs. HW/SW Locks © 2006 Elsevier SoC Lock Cache Hardware © 2006 Elsevier Short vs. Long Critical Sections A short critical section has a relatively short time between lock acquisition and release For example, less than 1,000 cycles Don’t switch to another task while waiting for lock Locks are associated with PEs A long critical section has a relatively long time between lock acquisition and release For example, less than 1,000 cycles Locks are associated with tasks on PEs More hardware is required to track tasks © 2006 Elsevier Soc Lock Cache Interrupts © 2006 Elsevier SoC Lock Cache Results Area is less than 0.1% of full SoC design © 2006 Elsevier Coherent parallel memory systems Caches need to be coherent. Cache snooping is a common approach When data is accessed from memory, look for it in other caches . © 2006 Elsevier Application-Aware Snoop Filtering [Zho08] In embedded systems, designer may know which memory is shared between tasks Snooping is enabled only for the accesses referring to known shared regions. Identify shared memory regions for each task Reduces power consumption due to snooping Provide this info to the operating system and cache snoop controller for runtime utilization Focus on write-back caches with writeinvalidate protocol © 2006 Elsevier Snoop Filtering Architecture Snoop filter determines if the D-cache should actually be snooped . © 2006 Elsevier Shared Memory Identification With no virtual memory: Utilize Shared Address Segments (SAS) mechanism Programmer identifies shared structures Compiler controls placement of data Aligns data on 2m address boundary Identify segment using a SegID (MSBs of address) What if arrays are not of size 2m? © 2006 Elsevier SAS Snoop Filtering Hardware For each shared segment to be supported SegDim indicates size of the segment (bit mask) SegId indicates start of the segment Compare SegID with address MSBs What if you have more shared segments than hardware for identifying them? © 2006 Elsevier Snoop Filtering Results Snoop activities are report for direct mapped and 4-way caches for write-invalidate and write-update mechanisms Snoop activity is reduced by 51% to 98% © 2006 Elsevier Virtual memory and snoop filtering Recent embedded processors provide virtual memory support through MMUs Translate virtual address (VPN + offset) to physical address (PPN + offset) Provides transparent memory allocation, isolation, and protection for tasks Requires page table (PT) and translation lookaside buffer (TLB) to translate the VPN to PPN Programmer and compiler no longer know physical address Different technique is needed for snoop filtering © 2006 Elsevier Shared Memory Identification With virtual memory: Utilize Shared Page Set (SPS) mechanism Programmer identifies shared structures Provides array starting address and size Identifies which threads use which structures Operating systems assigns RegID Stores this information in the page table (PT) and translation lookaside buffer (TLB) © 2006 Elsevier SPS Snoop Filtering Hardware The PT and TLB are augmented with the RegID for each page The information for shared regions for each task is loaded by the operating system Implemented using a bit mask register with one bit for each shared region. For example: 01010100 indicates a task uses shared regions 2, 4, and 6. On a cache miss, the RegID is transmitted along the databus. Filtering hardware at each node checks if the current task has shared data in RegID region. © 2006 Elsevier SPS Snoop Filtering Hardware © 2006 Elsevier Snoop Filtering Energy Results Snoop energies are reported for direct mapped and 4-way caches for write-invalidate (WI) and write-update (WU) mechanisms WI requires much less energy than WU Snoop energy is reduced by 47% to 93% SPS only used with WI © 2006 Elsevier ARM11 MPCore © 2006 Elsevier ARM11 MPCore Features Up to 4 CPUs implementing ARM v6 Snoop Control Unit for Cache Coherency Distributed Interrupt Controller Private Timer and Private Watchdog for each CPU AXI high speed Advanced Microprocessor Bus Architecture (AMBA) L2 memory interfaces Flexibility configuration during synthesis. © 2006 Elsevier ARM11 MPCore Pipeline Stages Stage 1 1st Fetch Stage (Fe1) Stage 2 1st Fetch Stage (Fe2) Stage 3 Instruction Decode (De) Stage 4 Reg. read and issue (Iss) Stage 5 Stage 6 Shifter Stage (Sh) ALU Operation (ALU) Saturation Stage (Sat) 1st multiply acc. Stage (MAC1) 2nd multiply acc. Stage (MAC2) 3rd multiply acc. Stage (MAC3) Address Generation (ADD) Data cache 1 (DC1) Data cache 2 (DC2) © 2006 Elsevier Stage 7 Stage 8 Write back Mul/ALU (WBex) Write back from LSU (WBIs) ARM11 MPCore Caches Instruction and data caches, including a nonblocking data cache with Hit-Under-Miss (HUM) Data cache is physically indexed, physically tagged, write back, write allocate only Instruction cache is virtually indexed, physically tagged 32-bit interface to the instruction cache and 64-bit interface to the data cache Hardware support for data cache coherency The instruction and data cache can be independently configured during synthesis to sizes between 16KB and 64KB. ARM11 MPCore Caches Both caches are 4-way set-associative. Cache line replacement policy is round-robin. The cache line length is eight 32-bit words. Both data cache read misses and write misses are non-blocking. Up to three outstanding data cache read misses and up to four outstanding data cache write misses are supported. Support is provided for streaming of sequential data with LDM operations, and for sequential instruction fetches. On a cache-miss, critical word first filling of the cache is performed. Coherency protocol – MESI MESI is a write-invalidate protocol Writing to a shared location invalidates corresponding lines in other L1 caches Cache lines can be in one of four states Modified: The cache line is present only in the current cache, and it is dirty. It has been modified from the value in main memory. Exclusive: The cache line is present only in the current cache, and is clean. It matches the main memory value. Shared: The cache line is present in more than one CPU cache and is clean. It matches main memory value. Invalid: This coherent cache line is not present in the cache. L1 Data Memory L1 Instruction Memory Level 2 Memory - AXI MPCore Level 2 Supported AXI transfers The ARM11 MPCore processor Level 2 interface consists, by default, of two 64-bit wide AXI bus masters coherent and non-coherent write-back write-allocate coherent non-cachable AXI transaction IDs The arbitration for transaction ordering on Athe XI masters is round robin among the requesting MP11 CPUs