TDTS30: Design space exploration Outline § § § § § Introduction Design Flow Design Space Exploration Communication in MPSoCs Mapping & Scheduling in MPSoCs 2 1 Administrative Info § Labs in the SU rooms § ssh to a remote linux computer from the pool (sy00-1.ida.liu.se – sy007.ida.liu.se) § Ex: ssh su00-3.ida.liu.se § source /home/TDTS30/sw/mparm/go_mparm.sh 3 GSM GSMPhone: Phone: §Search §Search §Radio §RadioLink LinkControl Control §Talking §Talking MP3 MP3player player Digital DigitalCamera: Camera: §Take §TakePhoto Photo §Restore §RestorePhoto Photo ... ... §High performance §Time to market §Low power 4 2 MPSoC Architecture ARM ARM ARM CACHE CACHE CACHE Interrupt Device Bus (AMBA, ST) Private Private Private Semaphore Shared Memory Memory Memory Device Memory 5 System Design Flow Informal specification, constraints Modelling Functional simulation Architecture selection System model System architecture Mapping Estimation Scheduling not ok Mapped and scheduled model not ok Hardware and Software Implementation Testing ok not ok Prototype Fabrication 6 3 System Design Flow Informal specification, constraints Modelling Functional simulation Architecture selection System model System architecture Mapping Estimation Scheduling not ok Mapped and scheduled model MPARM not ok Hardware and Software Implementation Testing ok not ok Prototype Fabrication 7 System Design Flow Hardware platform Software Application(s) Extract Task Graph Extract Task Parameters Optimize (Mapping & Sched) Formal Simulation Implement 8 4 MPARM: Hardware § § § § § § § § Simulation platform Up to 8 ARM7 processors Variable frequency (dynamic and static) Split/Unified instruction and data caches Private memory for each processor Scratchpad Shared Memory Bus: AMBA, ST § This platform can be fine tuned for a given application 9 MPARM: Hardware This platform can be fine tuned for a given application. Read more in: /home/TDTS30/sw/mparm/MPARM/doc/simulator_statistics.txt 10 5 MPARM: Software § C crosscompiler chain for building the software applications § No operating system § Small set of functions (pr, shared_alloc, WAIT/SIGNAL) § RTEMS operating system 11 MPARM: Why? § Cycle accurate simulation of the system § Various interesting statistics: number of clock cycles executed, bus utilization, cache efficiency, energy/power consumption of the componenets (CPU cores, bus, memories) 12 6 MPARM: How? § mpsim.x –c2 runs a simulation on 2 processors, collecting the default statistics § mpsim.x –c2 –w runs a simulation with power/energy statistics § mpsim.x –c1 –is 9 –ds 10 :one processor with instruction cache of size 512 and data cache of 1024 bytes § mpsim.x –c2 –F0,2 –F1,1 –F3,3 :two processors running at 100 MHz, 200 MHz and the bus running at 66 MHz § mpsim.x –h for the rest § Simulation results are in the file stats.txt 13 Design Space Exploration § Generic term for system optimization § Platform optimization: § § § § Select the number of processors Select the speed of each processor Select the type, associativity and size of the cache Select the bus type § Application optimization § Select the interprocessor communication style (shared memory or distributed message passing) § Select the best mapping and schedule 14 7 Design Space Exploration for the GSM codec Assignment 1 § Given the GSM code § Running on 1 ARM7 processor § The variables are: § cache parameters § processor frequency § Using the MPARM simulator, find a hardware configuration that minimizes the energy of the system 15 Energy/Speed Tradeoff 0.75V, 60mW 150MHz RUN 1.3V, 450mW RUN 600MHz RUN 1.6V, 900mW RUN 160µs 800MHz RUN 10µs 1.5ms 10µs IDLE 40mW 140ms 90µs SLEEP 160µW 16 8 Frequency Selection: Total Energy 11 Energy [mJ] 10.5 10.0 9.50 9.00 8.50 1 1.5 2 2.5 3 3.5 Freq. devider 4 50 MHz 200 MHz 17 Energy [mJ] Frequency Selection: ARM Core Energy 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 1 1.5 2 2.5 3 3.5 4 Freq. devider 18 9 Instruction Cache Size: Total Energy 12.5 Energy [mJ] 12.0 11.5 11.0 10.5 10.0 9.50 9.00 8.50 9 10 11 12 13 log2(CacheSize) 29=512 bytes 14 214=16 kbytes 19 Instruction Cache Size: Execution Time 1e+08 9.5e+07 t [cycles] 9e+07 8.5e+07 8e+07 7.5e+07 7e+07 6.5e+07 6e+07 5.5e+07 5e+07 9 10 11 12 13 14 log2(CacheSize) 20 10 Interprocessor Data Communication CPU1 ... a=1 ... CPU2 ... print a; ... BUS 21 Shared Memory CPU1 ... a=1 ... Shared Mem a CPU2 a=2 print a; BUS a=? Synchronization 22 11 Synchronization CPU1 Semaphore a=1 sem_a signal(sem_a) Shared Mem CPU2 a=2 wait(sem_a) a=2 print a; a BUS 23 Synchronization In[f/t]ernals CPU1 Semaphore a=1 po llin g sem_a=1 sem_a signal(sem_a) Shared Mem CPU2 while (sem_a==0) wait(sem_a) a=2 print a; a BUS 24 12 Distributed Message Passing CPU1 a=1 signal(sem_a) Shared Mem a CPU2 wait(sem_a) a=2 print a; sem_a BUS 25 Distributed Message Passing (2) a=1 CPU1(prod) a=1 signal(sem_a) CPU2 (cons) wait(sem_a) print a; sem_a a=1 BUS 26 13 Assignment 2 § You are given 2 implementations of the GSM codec § Shared memory § Distributed message passing § Simulate and compare these 2 approaches 27 System Design Flow Hardware platform Software Application(s) Extract Task Graph Extract Task Parameters Optimize (Mapping & Sched) Formal Simulation Implement 28 14 Task Graph Extraction Example for (i=0;i<100;i++)a[i]=1;//TASK 1 for (i=0;i<100;i++) b[i]=1;//TASK 2 for (i=0;i<100;i++) c[i]=a[i]+b[i];//TASK 3 τ1 τ2 τ3 Task 1 and 2 can be executed in parallel Task 3 has data dependency on 1 and 2 29 Execution Time Extraction § Using the simulator § This an ”average” execution time § In real-time systems, another execution time is used: worst-case (WCET) § Can be extracted using the dump_light_metric(int x) in MPARM 30 15 Execution Time Extraction Example start_metric(); for (i=0;i<100;i++) a[i]=1;//TASK 1 dump_light_metric(1); for (i=0;i<100;i++) a[i]=1;//TASK 2 dump_light_metric(2); stop_metric(); stop_simulation(); 31 Execution Time Extraction Example (2) Task 1 Interconnect statistics ----------------------Overall exec time Task NC 1-CPU average exec time Concurrent exec time Bus busy Bus transferring data of 144) ----------------------Task 2 Interconnect statistics ----------------------Overall exec time Task NC 1-CPU average exec time Concurrent exec time Bus busy Bus transferring data 39.73% of 813) = = = = = = 287 system cycles (1435 ns) 287 0 system cycles (0 ns) 287 system cycles (1435 ns) 144 system cycles (50.17% of 287) 64 system cycles (22.30% of 287, 44.44% = = = = = = 5554 system cycles (27770 ns) 5267 0 system cycles (0 ns) 5554 system cycles (27770 ns) 813 system cycles (14.64% of 5554) 323 system cycles (5.82% of 5554, 32 16 Application Mapping and Scheduling CPU1 τ2 τ3 dl=6 τ1 dl=3 CPU0 τ4 dl=9 Bus τ0 + = CPU2 τ5 γ4-5 τ1 τ2 γ1-3 = τ0 τ4 γ3-5 CPU0 γ2-4 γ0-2 CPU1 τ3 CPU2 33 Mapping in MPARM § Using the RTEMS OS -> not this time § Using the lightweight API of MPARM § get_proc_id() 34 17 Mapping in MPARM Example if (get_proc_id()==1) { //Task 1 executed on processor 1 for (i=1;i<100;i++) a[i]=1; } if (get_proc_id()==2) { //Task 1 executed on processor 2 for (i=1;i<100;i++) b[i]=1; } 35 Scheduling in MPARM § Using the RTEMS OS -> not this time § The schedule is given by the code sequence executed on one processor //task 1 for(i=1;i<100;i++)a[i]=1; //task 2 for(i=1;i<100;i++)b[i]=3; Schedule 2 Schedule 1 //task 2 for(i=1;i<100;i++)b[i]=3; //task 1 for(i=1;i<100;i++)a[i]=1; 36 18 Assignment 3: Mapping and Scheduling § § § § § GSM codec Task graph is given Extract the task execution times Given a mapping, improve the scheduling Find a better mapping 37 19