Lab 2: Design-Space Exploration with MPARM ‹#› of 14 Outline The MPARM simulation framework Hardware Software Design-space exploration Communication in MPSoC ‹#› ‹#› of 14 MPSoC Architecture ARM ARM ARM CACHE CACHE CACHE Interrupt Device Bus Private Private Private Semaphore Shared Memory Memory Memory Device Memory ‹#› ‹#› of 14 System Design Flow Informal specification, constraints Modeling Functional simulation Architecture selection System model System architecture Mapping Estimation Scheduling not ok Mapped and scheduled model not ok Hardware and Software Implementation Testing ok not ok Prototype Fabrication ‹#› ‹#› of 14 System Design Flow Hardware platform Software Application(s) Extract Task Graph Extract Task Parameters Optimize (Mapping & Sched) Formal Simulation Implement ‹#› ‹#› of 14 MPARM: Hardware ARM7 processors (up to eight) Variable frequency (dynamic and static) Instruction and data caches Private memory Scratchpad Shared memory Communication bus Read more in: /home/TDTS07/sw/mparm/MPARM/doc/simulator_statistics.txt ‹#› ‹#› of 14 MPARM: Software Cross-compiler toolchain for building software No operating system Small set of functions (such as WAIT and SIGNAL) (look in the application code) ‹#› ‹#› of 14 MPARM: Why? Cycle-accurate simulation of the system Various statistics: number of clock cycles executed, bus utilization, cache efficiency, and energy/power consumption of the components (CPUs, buses, and memories) ‹#› ‹#› of 14 MPARM: How? mpsim.x -c2 — run on two processors, collecting default statistics mpsim.x -c2 -w — run on two processors, collecting power/energy statistics mpsim.x -c1 --is=9 --ds=10 — run on one processor with instruction cache of 512 bytes and data cache of 1024 bytes mpsim.x -c2 -F0,2 -F1,1 -F3,3 — run on two processors operating at 100 MHz and 200 MHz and the bus operating at 66 MHz 200 MHz is the ”default” frequency mpsim.x -h — show other options Simulation results are in the file stats.txt ‹#› ‹#› of 14 Design-Space Exploration Platform optimization Select Select Select Select the the the the number of processors speed of each processor type, associativity, and size of the cache bus type Application optimization Select the interprocessor communication style (shared memory or distributed message passing) Select the best mapping and schedule ‹#› ‹#› of 14 Assignment 1 Given a GSM codec Running on one ARM7 processor Variables Cache parameters Processor frequency Using MPARM, find a hardware configuration that minimizes the energy of the system ‹#› ‹#› of 14 Energy/Speed Tradeoff CPU model 0.75V, 60mW 150MHz RUN 1.3V, 450mW RUN 600MHz RUN 1.6V, 900mW RUN 160ms 800MHz RUN 10ms 1.5ms 10ms IDLE 40mW 140ms 90ms SLEEP 160mW ‹#› ‹#› of 14 Energy [mJ] Frequency Selection: ARM Core Energy 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 1 1.5 2 2.5 3 3.5 4 Freq. divider ‹#› ‹#› of 14 Frequency Selection: Total Energy 11 Energy [mJ] 10.5 10 9.5 9.0 8.5 1 1.5 2 2.5 3 3.5 4 Freq. divider ‹#› ‹#› of 14 Instruction Cache Size: Total Energy 12.5 Energy [mJ] 12.0 11.5 11.0 10.5 10.0 9.50 9.00 8.50 9 29=512 bytes 10 11 12 log2(CacheSize) 13 14 214=16 kbytes ‹#› ‹#› of 14 Instruction Cache Size: Execution Time 1e+08 9.5e+07 t [cycles] 9e+07 8.5e+07 8e+07 7.5e+07 7e+07 6.5e+07 6e+07 5.5e+07 5e+07 9 10 11 12 13 14 log2(CacheSize) ‹#› ‹#› of 14 Interprocessor Data Communication CPU1 ... a=1 ... How? CPU2 ... print a; ... BUS ‹#› ‹#› of 14 Shared Memory CPU1 ... a=1 ... Shared Mem a CPU2 a=2 print a; BUS a=? Synchronization ‹#› ‹#› of 14 Synchronization With semaphores CPU1 Semaphore a=1 sem_a signal(sem_a) Shared Mem CPU2 a=2 wait(sem_a) a=2 print a; a BUS ‹#› ‹#› of 14 Synchronization Internals (1) CPU1 Semaphore a=1 sem_a=1 sem_a signal(sem_a) Shared Mem CPU2 while (sem_a==0) wait(sem_a) a=2 print a; a BUS ‹#› ‹#› of 14 Synchronization Internals (2) Disadvantages of polling Results in higher power consumption (energy from the battery) Larger execution time of the application Blocking important communication on the bus ‹#› ‹#› of 14 Distributed Message Passing Instead Direct CPU-CPU communication with distributed semaphores Each CPU has its own scratchpad Smaller and faster than a RAM Smaller energy consumption than a cache Put frequently used variables on the scratchpad Cache controlled by hardware (cache lines, hits/misses, …) Scratchpad controlled by software (e.g., compiler) Semaphores allocated on scratchpads No polling ‹#› ‹#› of 14 Distributed Message Passing (1) CPU1 a=1 signal(sem_a) Shared Mem a CPU2 wait(sem_a) a=2 print a; sem_a BUS ‹#› ‹#› of 14 Distributed Message Passing (2) a=1 CPU1(prod) a=1 signal(sem_a) CPU2 (cons) wait(sem_a) print a; sem_a a=1 BUS ‹#› ‹#› of 14 Assignment 2 Given two implementations of the GSM codec Shared memory Distributed message passing Simulate and compare these two approaches Energy Runtime ‹#› ‹#› of 14 Thank you! Questions? ‹#› ‹#› of 14