Lab 2: Formal Verification with UPPAAL ‹#› of 14 The Gossiping Persons There are n people. Each has a secret. They are desperate to tell their secrets to each other. They communicate over the phone. When two people are on the phone, they exchange the secrets they currently know. What is the minimum number of calls needed so that each person knows all secrets? How to model phone calls? Channels. How to represent secrets? Bit flags. How to exchange secrets? Global variables. ‹#› ‹#› of 14 Lab 3: Design-Space Exploration with MPARM ‹#› of 14 Outline The MPARM simulation framework Hardware Software Design-space exploration Communication in MPSoC Mapping and scheduling ‹#› ‹#› of 14 MPSoC Architecture ARM ARM ARM CACHE CACHE CACHE Interrupt Device Bus Private Private Private Semaphore Shared Memory Memory Memory Device Memory ‹#› ‹#› of 14 System Design Flow Informal specification, constraints Modeling Functional simulation Architecture selection System model System architecture Mapping Estimation Scheduling not ok Mapped and scheduled model not ok Hardware and Software Implementation Testing ok not ok Prototype Fabrication ‹#› ‹#› of 14 System Design Flow Hardware platform Software Application(s) Extract Task Graph Extract Task Parameters Optimize (Mapping & Sched) Formal Simulation Implement ‹#› ‹#› of 14 MPARM: Hardware ARM7 processors (up to eight) Variable frequency (dynamic and static) Instruction and data caches Private memory Scratchpad Shared memory Communication bus Read more in: /home/TDTS07/sw/mparm/MPARM/doc/simulator_statistics.txt ‹#› ‹#› of 14 MPARM: Software Cross-compiler toolchain for building software No operating system Small set of functions (such as WAIT and SIGNAL) (look in the application code) ‹#› ‹#› of 14 MPARM: Why? Cycle-accurate simulation of the system Various statistics: number of clock cycles executed, bus utilization, cache efficiency, and energy/power consumption of the components (CPUs, buses, and memories) ‹#› ‹#› of 14 MPARM: How? mpsim.x -c2 — run on two processors, collecting default statistics mpsim.x -c2 -w — run on two processors, collecting power/energy statistics mpsim.x -c1 --is=9 --ds=10 — run on one processor with instruction cache of 512 bytes and data cache of 1024 bytes mpsim.x -c2 -F0,2 -F1,1 -F3,3 — run on two processors operating at 100 MHz and 200 MHz and the bus operating at 66 MHz 200 MHz is the ”default” frequency mpsim.x -h — show other options Simulation results are in the file stats.txt ‹#› ‹#› of 14 Design-Space Exploration Platform optimization Select Select Select Select the the the the number of processors speed of each processor type, associativity, and size of the cache bus type Application optimization Select the interprocessor communication style (shared memory or distributed message passing) Select the best mapping and schedule ‹#› ‹#› of 14 Assignment 1 Given a GSM codec Running on one ARM7 processor Variables Cache parameters Processor frequency Using MPARM, find a hardware configuration that minimizes the energy of the system ‹#› ‹#› of 14 Energy/Speed Tradeoff CPU model 0.75V, 60mW 150MHz RUN 1.3V, 450mW RUN 600MHz RUN 1.6V, 900mW RUN 160ms 800MHz RUN 10ms 1.5ms 10ms IDLE 40mW 140ms 90ms SLEEP 160mW ‹#› ‹#› of 14 Energy [mJ] Frequency Selection: ARM Core Energy 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 1 1.5 2 2.5 3 3.5 4 Freq. divider ‹#› ‹#› of 14 Frequency Selection: Total Energy 11 Energy [mJ] 10.5 10 9.5 9.0 8.5 1 1.5 2 2.5 3 3.5 4 Freq. divider ‹#› ‹#› of 14 Instruction Cache Size: Total Energy 12.5 Energy [mJ] 12.0 11.5 11.0 10.5 10.0 9.50 9.00 8.50 9 29=512 bytes 10 11 12 log2(CacheSize) 13 14 214=16 kbytes ‹#› ‹#› of 14 Instruction Cache Size: Execution Time 1e+08 9.5e+07 t [cycles] 9e+07 8.5e+07 8e+07 7.5e+07 7e+07 6.5e+07 6e+07 5.5e+07 5e+07 9 10 11 12 13 14 log2(CacheSize) ‹#› ‹#› of 14 Interprocessor Data Communication CPU1 ... a=1 ... How? CPU2 ... print a; ... BUS ‹#› ‹#› of 14 Shared Memory CPU1 ... a=1 ... Shared Mem a CPU2 a=2 print a; BUS a=? Synchronization ‹#› ‹#› of 14 Synchronization With semaphores CPU1 Semaphore a=1 sem_a signal(sem_a) Shared Mem CPU2 a=2 wait(sem_a) a=2 print a; a BUS ‹#› ‹#› of 14 Synchronization Internals (1) CPU1 Semaphore a=1 sem_a=1 sem_a signal(sem_a) Shared Mem CPU2 while (sem_a==0) wait(sem_a) a=2 print a; a BUS ‹#› ‹#› of 14 Synchronization Internals (2) Disadvantages of polling Results in higher power consumption (energy from the battery) Larger execution time of the application Blocking important communication on the bus ‹#› ‹#› of 14 Distributed Message Passing Instead Direct CPU-CPU communication with distributed semaphores Each CPU has its own scratchpad Smaller and faster than a RAM Smaller energy consumption than a cache Put frequently used variables on the scratchpad Cache controlled by hardware (cache lines, hits/misses, …) Scratchpad controlled by software (e.g., compiler) Semaphores allocated on scratchpads No polling ‹#› ‹#› of 14 Distributed Message Passing (1) CPU1 a=1 signal(sem_a) Shared Mem a CPU2 wait(sem_a) a=2 print a; sem_a BUS ‹#› ‹#› of 14 Distributed Message Passing (2) a=1 CPU1(prod) a=1 signal(sem_a) CPU2 (cons) wait(sem_a) print a; sem_a a=1 BUS ‹#› ‹#› of 14 Assignment 2 Given two implementations of the GSM codec Shared memory Distributed message passing Simulate and compare these two approaches Energy Runtime ‹#› ‹#› of 14 System Design Flow Hardware platform Software Application(s) Extract Task Graph Extract Task Parameters Optimize (Mapping & Sched) Formal Simulation Implement ‹#› ‹#› of 14 Task Graph Extraction Example for (i=0;i<100;i++) a[i]=1; // T1 for (i=0;i<100;i++) b[i]=1; // T2 for (i=0;i<100;i++) c[i]=a[i]+b[i]; // T3 t1 t2 t3 Task 1 and 2 can be executed in parallel Task 3 has data dependency on 1 and 2 ‹#› ‹#› of 14 Execution Time Extraction Using the simulator This is an ”average” execution time Can be extracted using the dump_light_metric() in MPARM ‹#› ‹#› of 14 Execution Time Extraction Example (1) start_metric(); for (i=0;i<100;i++) a[i]=1; // T1 dump_light_metric(); for (i=0;i<100;i++) b[i]=1; // T2 dump_light_metric(); stop_metric(); stop_simulation(); ‹#› ‹#› of 14 Execution Time Extraction Example (2) Task 1 Interconnect statistics ----------------------Overall exec time Task NC 1-CPU average exec time Concurrent exec time Bus busy Bus transferring data of 144) ----------------------Task 2 Interconnect statistics ----------------------Overall exec time Task NC 1-CPU average exec time Concurrent exec time Bus busy Bus transferring data 39.73% of 813) = = = = = = 287 system cycles (1435 ns) 287 0 system cycles (0 ns) 287 system cycles (1435 ns) 144 system cycles (50.17% of 287) 64 system cycles (22.30% of 287, 44.44% = = = = = = 5554 system cycles (27770 ns) 5267 0 system cycles (0 ns) 5554 system cycles (27770 ns) 813 system cycles (14.64% of 5554) 323 system cycles (5.82% of 5554, ‹#› ‹#› of 14 t0 Application Mapping and Scheduling CPU1 t2 t3 dl=6 CPU0 t4 dl=9 = Bus t1 dl=3 + CPU2 t5 t5 g4-5 t1 t2 g1-3 = t0 t4 g3-5 CPU0 g2-4 g0-2 CPU1 t3 CPU2 ‹#› ‹#› of 14 Mapping in MPARM Example Using the lightweight API of MPARM get_proc_id() if (get_proc_id() == 1) { // T1 executed on CPU1 for (i=1; i<100; i++) a[i]=1; } if (get_proc_id() == 2) { // T1 executed on CPU2 for (i=1; i<100; i++) b[i]=1; } ‹#› ‹#› of 14 Scheduling in MPARM The schedule is given by the code sequence executed on one processor // T1 for(i=1;i<100;i++)a[i]=1; // T2 for(i=1;i<100;i++)b[i]=3; Schedule 2 Schedule 1 // T2 for(i=1;i<100;i++)b[i]=3; // T1 for(i=1;i<100;i++)a[i]=1; ‹#› ‹#› of 14 Assignment 3: Mapping and Scheduling Theoretical exercise (you will not use MPARM) Task graph and task execution times are given Construct two schedules with the same minimal length: keeping same task mapping changing the mapping ‹#› ‹#› of 14 Thank you! Questions? ‹#› ‹#› of 14