Lab 2: Design-Space Exploration with MPARM ‹#› of 14

advertisement
Lab 2: Design-Space Exploration with MPARM
‹#› of 14
Outline
 The MPARM simulation framework




Hardware
Software
Design-space exploration
Communication in MPSoC
‹#›
‹#› of 14
MPSoC Architecture
ARM
ARM
ARM
CACHE
CACHE
CACHE
Interrupt
Device
Bus
Private Private Private Semaphore Shared
Memory Memory Memory Device
Memory
‹#›
‹#› of 14
System Design Flow
Informal specification,
constraints
Modeling
Functional
simulation
Architecture
selection
System model
System
architecture
Mapping
Estimation
Scheduling
not ok
Mapped and
scheduled model
not ok
Hardware and
Software
Implementation
Testing
ok
not ok
Prototype
Fabrication
‹#›
‹#› of 14
System Design Flow
Hardware
platform
Software
Application(s)
Extract
Task Graph
Extract Task
Parameters
Optimize
(Mapping & Sched)
Formal
Simulation
Implement
‹#›
‹#› of 14
MPARM: Hardware








ARM7 processors (up to eight)
Variable frequency (dynamic and static)
Instruction and data caches
Private memory
Scratchpad
Shared memory
Communication bus
Read more in:

/home/TDTS07/sw/mparm/MPARM/doc/simulator_statistics.txt
‹#›
‹#› of 14
MPARM: Software
 Cross-compiler toolchain for building software
 No operating system
 Small set of functions (such as WAIT and SIGNAL)
 (look in the application code)
‹#›
‹#› of 14
MPARM: Why?
 Cycle-accurate simulation of the system
 Various statistics: number of clock cycles
executed, bus utilization, cache efficiency,
and energy/power consumption of the
components (CPUs, buses, and memories)
‹#›
‹#› of 14
MPARM: How?
 mpsim.x -c2 — run on two processors, collecting
default statistics
 mpsim.x -c2 -w — run on two processors, collecting
power/energy statistics
 mpsim.x -c1 --is=9 --ds=10 — run on one processor
with instruction cache of 512 bytes and data cache of
1024 bytes
 mpsim.x -c2 -F0,2 -F1,1 -F3,3 — run on two
processors operating at 100 MHz and 200 MHz and
the bus operating at 66 MHz
 200 MHz is the ”default” frequency
 mpsim.x -h — show other options
 Simulation results are in the file stats.txt
‹#›
‹#› of 14
Design-Space Exploration
 Platform optimization




Select
Select
Select
Select
the
the
the
the
number of processors
speed of each processor
type, associativity, and size of the cache
bus type
 Application optimization
 Select the interprocessor communication style (shared
memory or distributed message passing)
 Select the best mapping and schedule
‹#›
‹#› of 14
Assignment 1
 Given a GSM codec
 Running on one ARM7 processor
 Variables
 Cache parameters
 Processor frequency
 Using MPARM, find a hardware configuration that
minimizes the energy of the system
‹#›
‹#› of 14
Energy/Speed Tradeoff
CPU model
0.75V, 60mW
150MHz
RUN
1.3V, 450mW
RUN
600MHz
RUN
1.6V, 900mW
RUN
160ms
800MHz
RUN
10ms
1.5ms
10ms
IDLE
40mW
140ms
90ms
SLEEP
160mW
‹#›
‹#› of 14
Energy [mJ]
Frequency Selection: ARM Core Energy
2.8
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
1
1.5
2
2.5
3
3.5
4
Freq. divider
‹#›
‹#› of 14
Frequency Selection: Total Energy
11
Energy [mJ]
10.5
10
9.5
9.0
8.5
1
1.5
2
2.5
3
3.5
4
Freq. divider
‹#›
‹#› of 14
Instruction Cache Size: Total Energy
12.5
Energy [mJ]
12.0
11.5
11.0
10.5
10.0
9.50
9.00
8.50
9
29=512 bytes
10
11
12
log2(CacheSize)
13
14
214=16 kbytes
‹#›
‹#› of 14
Instruction Cache Size: Execution Time
1e+08
9.5e+07
t [cycles]
9e+07
8.5e+07
8e+07
7.5e+07
7e+07
6.5e+07
6e+07
5.5e+07
5e+07
9
10
11
12
13
14
log2(CacheSize)
‹#›
‹#› of 14
Interprocessor Data Communication
CPU1
...
a=1
...
How?
CPU2
...
print a;
...
BUS
‹#›
‹#› of 14
Shared Memory
CPU1
...
a=1
...
Shared Mem
a
CPU2
a=2
print a;
BUS
a=?
Synchronization
‹#›
‹#› of 14
Synchronization
With semaphores
CPU1
Semaphore
a=1
sem_a
signal(sem_a)
Shared Mem
CPU2
a=2
wait(sem_a)
a=2
print a;
a
BUS
‹#›
‹#› of 14
Synchronization Internals (1)
CPU1
Semaphore
a=1
sem_a=1
sem_a
signal(sem_a)
Shared Mem
CPU2
while
(sem_a==0)
wait(sem_a)
a=2
print a;
a
BUS
‹#›
‹#› of 14
Synchronization Internals (2)
 Disadvantages of polling
 Results in higher power consumption (energy
from the battery)
 Larger execution time of the application
 Blocking important communication on the bus
‹#›
‹#› of 14
Distributed Message Passing
 Instead
 Direct CPU-CPU communication with distributed
semaphores
 Each CPU has its own scratchpad
 Smaller and faster than a RAM
 Smaller energy consumption than a cache
 Put frequently used variables on the scratchpad
 Cache controlled by hardware (cache lines,
hits/misses, …)
 Scratchpad controlled by software (e.g., compiler)
 Semaphores allocated on scratchpads
 No polling
‹#›
‹#› of 14
Distributed Message Passing (1)
CPU1
a=1
signal(sem_a)
Shared Mem
a
CPU2
wait(sem_a)
a=2
print a;
sem_a
BUS
‹#›
‹#› of 14
Distributed Message Passing (2)
a=1
CPU1(prod)
a=1
signal(sem_a)
CPU2 (cons)
wait(sem_a)
print a;
sem_a
a=1
BUS
‹#›
‹#› of 14
Assignment 2
 Given two implementations of the GSM codec
 Shared memory
 Distributed message passing
 Simulate and compare these two approaches
 Energy
 Runtime
‹#›
‹#› of 14
Thank you!
Questions?
‹#›
‹#› of 14
Download