System-Level Memory Bus Power And Performance Optimization for Embedded Systems Ke Ning kning@ece.neu.edu David Kaeli kaeli@ece.neu.edu Why Power is More Important? “Power: A First Class Design Constraint for Future Architecture” – Trevor Mudge 2001 Increasing complexity for higher performance (MIPS) Parallelism, pipeline, memory/cache size Higher clock frequency, larger die size Rising dynamic power consumption CMOS process continues to shrink: Smaller size logic gates reduce Vthreshold Lower Vthreshold will have higher leakage Leakage power will exceed dynamic power Things Low getting worse in Embedded System power and low cost systems Fixed or Limited applications/functionalities Real-time systems with timing constraints 2 Power Breakdown of An Embedded System Research Target SDRAM Internal Dynamic UART SPORT1 SPORT0 PPI RTC Source: Analog Devices Inc. 3 Internal Leakage 25°C 1.2V Internal 400MHz CCLK Blackfin Processor 3.3V External 133MHz SDRAM 27MHz PPI Introduction Related work on microprocessor power Low power design trend Power metrics Power performance tradeoffs Power optimization techniques Power estimation framework Experimental framework built from Blackfin cycle accurate simulator Validated through a Blackfin EZKit board Power aware bus arbitration Memory page remapping 4 Outline Research Related Power Motivation and Introduction Work Estimation Framework Optimization I – Power-Aware Bus Arbitration Optimization II – Memory Page Remapping Summary 5 Power Modeling Dynamic power estimation Instruction level model: [Tiwari94], JouleTrack[Sinha01] Function level model: [Qu00] Architecture model: Cai-Lim Model, TEMPEST[CaiLim99], Wattch[Brooks00], Simplepower[Ye00] Static power estimation Butts-Sohi Previous Activity model [Butts00] memory system power estimation model: CACTI[Wilton96] Trace driven model: Dinero IV[Elder98] 6 Power Equation P = ACVDD2 f +V N k design DD dynamic A C VDD f N k design I leakage 7 leakage Activity Factor Total Capacitance Voltage Frequency Transistor Number Technology factor I leakage Common Power Optimization Techniques Gating (turn off unused components) Clock gating Voltage gating: Cache decay [Hu01] Scaling: (scale operating point of an component) Voltage scaling: Drowsy cache [Flautner02] Frequency scaling: [Pering98] Resource scaling: DRAM power mode [Delaluz01] Banking: (break single component into smaller sub-units) Vertical sub-banking: Filter cache[Kin97] Horizontal sub-banking: Scratchpad [Kandemir01] Clustering: (partition components into clusters) Switching reduction: (redesigning with lower activity) Bus encoding: Permutation Code [Mehta96], redundant code[Stan95, Benini98], WZE[Musoll97] 8 Power Aware Figure of Merit Delay, D Performance, Power, MIPS P Battery Obvious life (mobile), packaging (high performance) choice for power performance tradeoff, PD Joules/instruction, inversely MIPS/W Energy figure Mobile / low power applications Energy Delay PD2 MIPS2/W Energy [Gonzalez96] Delay Square PD3 MIPS3/W Voltage More 9 and frequency independent generically, MIPSm/W Power Optimization Effect on Power Figure Most of optimization schemes sacrifice performance for lower power consumption, except switching reduction. All of optimization schemes generate higher power efficiency. All of optimization schemes increase hardware complexity. 10 Outline Research Motivation and Introduction Related Power Estimation Framework Optimization I – Power-Aware Bus Arbitration Optimization II – Memory Page Remapping Summary 11 External Bus External Bus Components Typically is off-chip bus Includes: Control Bus, Address Bus, Data Bus External Bus Power Consumption Dynamic power factors: activity, capacitance, frequency, voltage Leakage power: supply voltage, threshold voltage, CMOS technology Different Longer from internal memory bus power: physical distance, higher bus capacitance, lower speed Cross line interference, Higher leakage current Different communication protocols (memory/peripheral dependent) Multiplexed row/column address bus, narrower data bus 12 Data Cache Instruction Cache Internal Bus System DMA Controller Memory DMA 0 13 Memory DMA 1 PPI DMA SPORT DMA NTSC/PAL Encoder Streaming Interface S-Video/CVBS NIC Power Modeling Area SDRAM External Bus Media Processor Core External Bus Interface Unit (EBIU) Embedded SOC System Architecture FLASH Memory Asynchronous Devices ADSP-BF533 EZ-Kit Lite Board Audio In Audio Out Video In & Out Video Codec/ ADV Converter Audio Codec/ AD Converter BF533 Blackfin Processor SPORT Data I/O SDRAM Memory FLASH Memory 14 External Bus Power Estimator Previous Approaches Used Hamming distance [Benini98] Control signal was not considered Shared row and column address bus Memory state transitions were not considered In Our Estimator Integrate memory control signal power into the model Consider the case where row and column address are shared Memory state transitions and stalls also cost power Consider page miss penalty and traffic reverse penalty P(bus) = P(page miss) + P(bus turnaround) + P(control signal) + P(address generation) + P(data transmission) + P(leakage) 15 Two External Bus SDRAM Timing Models (a) SDRAM Access in Sequential Command Mode tCAS t RCD t RP Bank 0 Request P N N A N N R R R R Bank 1 Request P N R R N A N N R R (b) SDRAM Access in Pipelined Command Mode Bank 0 Request Bank 1 Request P A N P N R R R R A R R System Clock Cycles (SCLK) P 16 - PRECHARGE A - ACTIVATE N - NOP R - READ Bus Power Simulation Framework Program Target Binary Compiler Instruction Level Simulator Memory Trace Generator Memory Hierarchy Model Memory Power Model Memory Technology Timing Model External Bus Power Estimator Developed software modules 17 Bus Power Multimedia Benchmark Configurations 18 Name Description I-Cache Size D-Cache Size MPEG2-ENC MPEG-2 Video encoder with 720x480 4:2:0 input frames. 16k 16k MPEG2-DEC MPEG-2 Video decoder of 720x480 sequence with 4:2:2 CCIR frame output. 16k 16k H264-ENC H.264/MPEG-4 Part 10 (AVC) digital video encoder for achieving very high data compression. 16k 16k H264-DEC H.264/MPEG-4 Part 10 (AVC) video decompression algorithm. 16k 16k JPEG-ENC JPEG image encoder for 512x512 image. 8k 8k JPEG-DEC JPEG image decoder for 512x512 image. 8k 8k PGP-ENC Pretty Good Privacy encryption and digital signature of text message. 8k 4k PGP-DEC Pretty Good Privacy decryption of encrypted message. 8k 4k G721-ENC G.721 Voice Encoder of 16bit input audio samples. 4k 2k G721-DEC G.721 Voice Decoder of encoded bits. 4k 2k Outline Research Related Power Motivation and Introduction Work Estimation Framework Optimization Optimization Summary 19 I – Power-Aware Bus Arbitration II – Memory Page Remapping Optimization I – Bus Arbitration Multiple bus access masters in an SOC system Processor cores Data/Instruction caches DMA ASIC modules Multimedia applications High bus bandwidth throughput Large memory footprint Efficient arbitration algorithm can: Increase power awareness Increase bus throughput Reduce bus power 20 Data Cache Instruction Cache Internal Bus System DMA Controller Memory DMA 0 21 Memory DMA 1 PPI DMA SPORT DMA NTSC/PAL Encoder Streaming Interface S-Video/CVBS NIC SDRAM External Bus Media Processor Core EBIU with Arbitration Enabled Bus Arbitration Target Region FLASH Memory Asynchronous Devices Bus Arbitration Schemes EBIU with arbitration enabled Handle core-to-memory and core-to-peripheral communication Resolve bus access contention Schedule bus access requests Traditional Algorithms First Come First Serve (FCFS) Fixed Priority Power Aware Algorithms (Categorized by power metric / cost function) Minimum Power (P1D0) or (1, 0) Minimum Delay (P0D1) or (0, 1) Minimum Power Delay Product (P1D1) or (1, 1) Minimum Power Delay Square Product (P1D2) or (1, 2) More generically (PnDm) or (n, m) 22 Bus Arbitration Schemes (Continued) Power Aware Arbitration From the current pending requests in the waiting queue, find a permutation of the external bus requests to achieve the minimum total power and/or performance cost. Reducible to minimum Hamiltonian path problem in a graph G(V,E). Vertex = Request R(t,s,b,l) t – request arrival time s – starting address b – block size l – read / write Edge = Transition of Request i and j. i,j - Request i and j edge weight w(i, j) is cost of transition 23 Minimum Hamiltonian Path Problem R0 R0 – Last Request on the Bus. Must be the starting point of a path. R1, R2, R3 – Requests in the queue R3 R1 R2 w(i,j) = P(i,j)nD(i,j)m P(i,j) – Power of Rj after Ri D(i,j) – Delay of Rj after Ri Hamiltonian Path: R0->R3->R1->R2 w(1,3) w(3,1) 24 Minimum Path weight = w(0,3)+w(3,1)+w(1,2) NP-Complete Problem Greedy Solution R0 Greedy Algorithm (local min) Only the next request in the path is needed R3 R1 R2 min{w(0,j) | w(i,j) is the edge weight of graph G(V,E)} In each iteration of arbitration: w(1,3) w(3,1) 25 1. A new graph G(V,E) need to be constructed. 2. A greedy solution request is arbitrated to use the bus. Experimental Setup Utilized embedded power modeling framework Implemented eleven different arbitration schemes inside EBIU FCFS, FixedPriority. minimum power (P1D0) or (1,0), minimum delay (P0D1) or (0, 1), and (1,1), (1,2), (2,1), (1,3), (3, 1), (3, 2), (2, 3) 10 multimedia application benchmarks are ported to Blackfin architecture and simulated, including MPEG-2, H.264, JPEG, PGP and G.721. 26 Power Improvement MPEG2 Decoder External Bus Power MPEG2 Encoder External Bus Power Pipelined Command 50.0 40.0 30.0 20.0 10.0 70.0 Average Bus Power (mW) Average Bus Power (mW) Sequential Command Sequential Command 60.0 Pipelined Command 60.0 50.0 40.0 30.0 20.0 10.0 0.0 FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1) Arbitration Algorithm 0.0 FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1) Arbitration Algorithm Power-aware arbitration schemes have lower power consumptions than Fixed Priority and FCFS. Difference across different power-aware arbitration strategies is small. Parallel Command model has 6-7% saving than Sequential Command model for MPEG2 ENC & DEC. The results are consistent to all other benchmarks. 27 Speed Improvement MPEG2 Decoder External Bus Delay MPEG2 Encoder External Bus Delay Sequential Command Sequential Command 200.0 Pipelined Command Pipelined Command 180.0 140.0 Average Delay (SCLK) Average Delay (SCLK) 160.0 120.0 100.0 80.0 60.0 40.0 20.0 160.0 140.0 120.0 100.0 80.0 60.0 40.0 20.0 0.0 0.0 FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1) Arbitration Algorithm FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1) Arbitration Algorithm Power-aware schemes have smaller bus delay than traditional Fixed Priority and FCFS. Difference across different power-aware arbitration strategies is small. Parallel Command model has 3-9% speedup than Sequential Command model for MPEG2 ENC & DEC. The results are consistent to all other benchmarks. 28 Comparison with Exhaustive Algorithm R0 Algorithm can fail in certain case. Complexity of O(n) vs O(n!). Performance difference is negligible: Exhaustive Search Greedy Search R3 R1 R2 18 17 new 29 Greedy Comments on Experimental Results Power aware arbitrators significantly reduce the external bus power for all 8 benchmarks. In average, there are 14% power saving. Power aware arbitrators reduce the bus access delay. The delay are reduced by 21% in average among 8 benchmarks. Pipelined SDRAM model has big performance advantage over sequential SDRAM model. It achieve 6% power saving and 12% speedup. Power and delay in external bus are highly correlated. Minimum power also achieves minimum delay. Minimum power schemes will lead to simpler design options. Scheme (1,0) is preferred due to its simplicity. 30 Design of A Power Estimation Unit (PEU) Bank(0) Open Row Addr Last Bank Address Bank(1) Open Row Addr Bank(2) Open Row Addr Bank(3) Open Row Addr Last Column Address Next Request Address Bank Addr Row Addr If not equal, output bank miss power If not equal, output page miss penalty power, update last column address register Column Addr Use hamming distance to calculate column address data power Power Estimation Unit (PEU) 31 Updated Column Addr Estimated Power Two Arbitrator Implementation Structures Shared PEU Structure Request Queue Buffer t t s s s s b b b b l l l l Dedicated PEU Structure Request Queue Buffer t t 32 s s s s b b b b l l l l Power Estimator Power Estimator Unit Power Estimator Unit Power Estimator Unit Unit (PEU) Minimum Power Request Access Command Generator External Bus State Update Memory/Bus States Info Comparator t t Power Estimator Unit (PEU) Comparator t t State Update Memory/Bus States Info Minimum Power Request Access Command Generator External Bus Performance of two structures MPEG-2 Decoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison 135.0 165.0 160.0 155.0 150.0 145.0 140.0 135.0 130.0 125.0 120.0 Average Delay (Cycles) Average Delay (Cycles) MPEG-2 Encoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison Estimator Unit Shared Estimator Unit Dedicated 130.0 Estimator Unit Shared 125.0 Estimator Unit Dedicated 120.0 115.0 110.0 105.0 100.0 0 2 4 6 Estimator Logic Delay (Cycles) 8 10 0 2 4 6 8 10 Estimator Logic Delay (Cycles) Higher PEU delay will lower the external bus performance for both MPEG-2 encoder and decoder. When PEU delay is 5 or higher, dedicated structure is preferred than shared structure. Otherwise, shared structure is enough. 33 Summary of Bus Arbitration Schemes Efficient bus arbitrations can provide benefits to both power and performance over traditional arbitration schemes. Minimum power and minimum delay are highly correlated on external bus performance. Pipelined SDRAM model has significant advantage over sequential SDRAM model. Arbitration scheme (1, 0) is recommended. Minimum power approach provides more design options and leads to simpler design implementations. The trade-off between design complexity and performance was presented. 34 Outline Research Related Power Motivation and Introduction Work Estimation Framework Optimization I – Power-Aware Bus Arbitration Optimization Summary 35 II – Memory Page Remapping Data Access Pattern in Multimedia Apps time time Fix Stride 2-Way Stream address 3 common data access patterns in multimedia applications Majority of cycles in loop bodies and array accesses High data access bandwidth Poor locality, cross page references time address 2-D Stride address 36 Previous work on Access Pattern Previous work was performance driven and OS/compiler related approach Data Pre-fetching [Chen94] [Zhang00] Memory Customization [Adve00] [Grun01] Data Layout Optimization [Catthoor98] [DeLaLuz04] Shortcoming Multimedia of OS/compiler-based strategies: benchmark’s dominant activities are within large monolithic data buffers. Buffers generally contain many memory pages and can not be further optimized. Constraint by the OS and compiler capability. Poor flexibility. 37 Optimization II - Page Remapping Technique currently used in large memory space peripheral memory access. External memories in embedded multimedia systems High bus access overhead Page miss penalty Efficient Reduce page remapping can page misses Improve external bus throughput Reduce power / energy consumption. 38 Data Cache Instruction Cache Internal Bus System DMA Controller Memory DMA 0 Memory DMA 1 39 PPI DMA SPORT DMA NTSC/PAL Encoder Streaming Interface S-Video/CVBS NIC SDRAM External Bus Media Processor Core External Bus Interface Unit (EBIU) Page Remapping Target Region FLASH Memory Asynchronous Devices SDRAM Memory Pages High memory access latency. Minimum latency of an sclk cycle Page miss penalty Additional latency due to refresh cycle No guaranteed access due to arbitration logic Non-sequential read/write would suffer Bank 0 Page 0 Page 1 Page 2 Page 3 Page 4 Page N-1 40 X X* X Bank 1 Bank 2 Bank M-1 X X* X X X X X* X X X* SDRAM Page Miss Penalty t RP COMMAND P t RCD A DATA COMMAND tCAS t RCD t RP R R R R P A tCAS R R R R D D D D P A DATA R R R R D D D D D D D D R R R R D D D D System Clock Cycles (SCLK) P - PRECHARGE 41 A - ACTIVATE R - READ N - NOP D - DATA SDRAM Timing Parameters SDRAM parameter Sclk cycles Access type Number of cycles trcd 1-15 Read cycle trp +n*(tcas) trp 1-7 Write cycle twp trcd = tras + trp 1-15 tcas 2-3 Page miss Refresh cycle trp + trcd 2*(trcd) * nrows twp trp tras tcas = = = = write to precharge read to precharge activate to precharge read latency ~8-10 sclk penalty associated with a page miss 42 SDRAM Page Access Sequence (I) 12 Reads across 4 banks Bank 0 Page 0 Page 1 Page 2 Page 3 R R R R R R R R Bank 1 Bank 2 Bank 3 R R R R PAR PAR PAR PAR PAR PAR PAR PAR PAR PAR PAR PAR System Clock Typically access pattern of 2-D stride / 2-way stream. Poor data layout causes significant access overhead. P – Precharge A – Activation R - Read 43 SDRAM Page Access Sequence (II) 12 Reads across 4 banks Bank 0 Page 0 Page 1 Page 2 Page 3 R R R Bank 1 RR R Bank 2 RR R Bank 3 RR R PAR PAR PAR PARR R R R R R R R System Clock Less access overhead with distributed data layout. P – Precharge A – Activation R - Read 44 Why we use Page Remapping Bank 0 Bank 1 Bank 2 Bank 3 X X X X X X X X Page 2 Page Remapping Entry of Page 2: {2,0,1,3} Page 2 45 Module in an SOC System 46 SDRAM External Bus External Bus Interface Unit (EBIU) Internal Bus Page Remapping Address FLASH Memory Asynchronous Devices translation unit, only translates bank address Non-MMU system inserts a page remapping module before EBIU MMU system can take advantage the existing address translation unit. No extra hardware needed Sequence (I) after Remapping 12 Reads across 4 banks Bank 0 Page 0 Page 1 Page 2 Page 3 Bank 1 Bank 2 Bank 3 R R R RR R RR R R R R PAR PAR PAR PAR R R R R R R R R System Clock Same performance as sequence II. Applicable for monolithic data buffers (eg. frame buffers). P – Precharge A – Activation R - Read 47 Page Remapping Algorithm NP complete problem. Reducible to graph coloring problem in a page transition graph G(V,E). Vertex = Page Im,n m – page bank number n – page row number Edge = Transition of Page Im,n to Ip,q. weighted edges captures page traversal during the program execution edge weight is number of transition from Page Im,n to Page Ip,q Color = Bank Each bank have one distinct color. Every page will be assigned one color. 48 Page Remapping Algorithm (continued) Page Remapping Algorithm From the page transition graph, find the color (bank) assignment for each page, such that the transition cost between same color pages is minimized. Algorithm Steps: Sort the edges based on their transition weight Edges are process in a decreasing weight order Color the pages associated with each edge Weight parameter array for each page represents the cost of mapping that page into each bank eg: {500, 200, 0, 0} 5 different situations of processing each edge Page remapping table (PMT) is generated as a result of mapping. 49 Example Case 100 I0,0 Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I0,1 I1,1 I2,1 I3,1 Page 1 I1,2 Page 2 I1,3 Page 3 I3,1 60 500 30 I0,1 I2,1 50 40 80 Original page allocation I1,1 I1,3 200 I1,2 Page transition graph 50 Initial Step Bank 0 Bank 1 Bank 2 Bank 3 Page 0 Page 1 Page 2 Page 3 51 No page is mapped. All slots are available. Step (1) – two unmapped pages Selected Edge: Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I0,1 Page 1 Page 2 Page 3 I0,0 500 Actions: Allocate unmapped pages I0,0 and I0,1 Weight Parameters Updates: I0,0[0]: { 0, 500, 0, 0} I0,1[1]: { 500, 0, 0, 0} 52 I0,1 Step (2) – two unmapped pages Selected Edge: Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I1,1 I0,1 Page 1 I1,2 Page 2 Page 3 I1,1 200 Actions: Allocate unmapped pages I1,1 and I1,2 Weight Parameters Updates: I1,1[0]: { 0, 200, 0, 0} I1,2[1]: { 200, 0, 0, 0} 53 I1,2 Step (3) – one unmapped page Selected Edge: Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I1,1 I0,1 I3,1 Page 1 I1,2 Page 2 Page 3 I0,0 100 Actions: Map pages I3,1 and no change for I0,0 Weight Parameters Updates: I3,1[2]: { 100, 0, 0, 0} I0,0[0]: { 0, 500, 100, 0} 54 I3,1 Step (4) – one unmapped page Selected Edge: Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I1,1 I0,1 I3,1 I2,1 Page 1 I1,2 Page 2 Page 3 I1,2 80 Actions: Map pages I2,1 and no change for I1,2 Weight Parameters Updates: I2,1[3]: { 0, 80, 0, 0} I1,2[1]: { 200, 0, 0, 80} 55 I2,1 Step (5) – one unmapped page Selected Edge: Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I1,1 I0,1 I3,1 I2,1 Page 1 I1,2 Page 2 I1,3 Page 3 56 I3,1 60 I1,3 Actions: Map pages I1,3 and no change for I3,1 Weight Parameters Updates: I1,3[0]: { 0, 0, 60, 0} I3,1[2]: { 160, 0, 0, 0} Step (6) – same row pages Selected Edge: Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I1,1 I0,1 I3,1 I2,1 Page 1 I1,2 Page 2 I1,3 Page 3 57 I1,1 50 I3,1 Actions: Both I1,1 and I3,1 are on the same row, no actions. Step (7) – two mapped pages Selected Edge: Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I1,1 I0,1 I3,1 I2,1 Page 1 I1,2 Page 2 I1,3 Page 3 58 I2,1 40 I1,3 Actions: Both I2,1 and I1,3 are mapped, no conflicts. Weight Parameters Updates: I1,3[0]: { 0, 0, 60, 40} I2,1[3]: { 40, 80, 0, 0} Step (8) – conflict resolving Selected Edge: Bank 0 Bank 1 Bank 2 Bank 3 Page 0 I0,0 I1,1 I0,1 I3,1 I2,1 Page 1 I1,2 Page 2 I1,3 Page 3 30 I0,0 Actions: Both I0,0 and I1,1 are mapped and in same bank. Current Weight Parameters: I0,1[1]: { 500, 0, 0, 0} I1,1[0]: { 30, 200, 0, 0} I2,1[3]: { 40, 80, 0, 0} I3,1[2]: { 160, 0, 0, 0} Bank 0 Bank 1 Bank 2 Bank 3 Page 0 Page 1 Page 2 Page 3 I0,0 I2,1 I3,1 I1,2 I0,1 I1,3 No Conflict 59 I1,1 I1,1 Updated Weight Parameters: I0,0[0]: {0, 500, 100, 30} Generated PMT table I1,1 I1,2 I1,3 I2,1 I3,1 I-Cache 00 10 xx xx xx 11 01 00 xx 00 xx xx xx 01 xx xx Page Remapping Table (4kB) I0,0 I0,1 D-Cache External Memory Address Memory Page Address (14bits) Row/Column Address (22bits) Bank Address (2bits) EBIU 16MB External SDRAM 60 Experimental Setup Utilized embedded power modeling framework Extended address translation unit for page remapping Page coloring program to generate PMT Same 10 Multimedia application benchmarks MPEG-2 encoder and decoder H.264 encoder and decoder JPEG encoder and decoder PGP encoder and decoder G.721 encoder and decoder 61 Page Miss Reduction 70.0 2 Bank Original Page Miss per 100 Requests 60.0 50.0 4 Bank Original 8 Bank Original 2 Bank Rem apped 4 Bank Rem apped 8 Bank Rem apped 40.0 30.0 20.0 10.0 0.0 MPEG2- MPEG2ENC DEC 62 H264ENC H264DEC JPEGENC JPEGDEC PGP-ENC PGP-DEC G721ENC G721DEC External Bus Power 40 2 Bank Original 35 4 Bank Original 8 Bank Original External Power (mW) 30 2 Bank Rem apped 4 Bank Rem apped 25 8 Bank Rem apped 20 15 10 5 0 MPEG2- MPEG2ENC DEC 63 H264ENC H264DEC JPEGENC JPEGDEC PGPENC PGPDEC G721ENC G721DEC Average Access Delay 120 2 Bank Original 4 Bank Original Average Request Delay (cycle) 100 8 Bank Original 2 Bank Rem apped 4 Bank Rem apped 80 8 Bank Rem apped 60 40 20 0 MPEG2- MPEG2ENC DEC 64 H264ENC H264DEC JPEGENC JPEGDEC PGPENC PGPDEC G721ENC G721DEC Comments of Page Remapping Page remapping algorithm is presented by example. Our algorithm can significantly reduce the memory page miss rate by 70-80% on average. For a 4-bank SDRAM memory system, we reduced external memory access time by 12.6%. The proposed algorithm can reduce power consumption in majority of the benchmarks, averaged by 13.2% of power reduction. Combining the effects of both power and delay, our algorithm can benefit significantly to the total energy cost. Stability study was done in dissertation. PMT table generated from one test vector input perform well on different inputs. 65 Outline Research Related Power Motivation and Introduction Work Estimation Framework Optimization I – Power-Aware Bus Arbitration Optimization II – Memory Page Remapping Summary 66 Summary Reviewed the issues of external bus power in a system-on-achip (SOC) embedded system. Built external bus power estimation framework and experimental methodology. PACS’04 Proposed a series of power aware bus arbitration schemes and their performance improvement over traditional schemes. HiPEAC’05 also appeared in LNCS Transaction of High performance of Embedded Architectures and Compilers Proposed page remapping algorithm to reduce page misses and its power and delay improvements. LCTES’07 67 Future Work Integration of power estimation framework in complete tool chain Extend arbitration schemes to multiple memory interfaces and other peripheral interfaces. Compare performance of page remapping with corresponding OS/Compiler schemes 68 Thank You ! 69