Lab 2: Design-Space Exploration with MPARM
‹#› of 14
Outline
The MPARM simulation framework
Hardware
Software
Design-space exploration
Communication in MPSoC
‹#›
‹#› of 14
MPSoC Architecture
ARM
ARM
ARM
CACHE
CACHE
CACHE
Interrupt
Device
Bus
Private Private Private Semaphore Shared
Memory Memory Memory Device
Memory
‹#›
‹#› of 14
System Design Flow
Informal specification,
constraints
Modeling
Functional
simulation
Architecture
selection
System model
System
architecture
Mapping
Estimation
Scheduling
not ok
Mapped and
scheduled model
not ok
Hardware and
Software
Implementation
Testing
ok
not ok
Prototype
Fabrication
‹#›
‹#› of 14
System Design Flow
Hardware
platform
Software
Application(s)
Extract
Task Graph
Extract Task
Parameters
Optimize
(Mapping & Sched)
Formal
Simulation
Implement
‹#›
‹#› of 14
MPARM: Hardware
ARM7 processors (up to eight)
Variable frequency (dynamic and static)
Instruction and data caches
Private memory
Scratchpad
Shared memory
Communication bus
Read more in:
/home/TDTS07/sw/mparm/MPARM/doc/simulator_statistics.txt
‹#›
‹#› of 14
MPARM: Software
Cross-compiler toolchain for building software
No operating system
Small set of functions (such as WAIT and SIGNAL)
(look in the application code)
‹#›
‹#› of 14
MPARM: Why?
Cycle-accurate simulation of the system
Various statistics: number of clock cycles
executed, bus utilization, cache efficiency,
and energy/power consumption of the
components (CPUs, buses, and memories)
‹#›
‹#› of 14
MPARM: How?
mpsim.x -c2 — run on two processors, collecting
default statistics
mpsim.x -c2 -w — run on two processors, collecting
power/energy statistics
mpsim.x -c1 --is=9 --ds=10 — run on one processor
with instruction cache of 512 bytes and data cache of
1024 bytes
mpsim.x -c2 -F0,2 -F1,1 -F3,3 — run on two
processors operating at 100 MHz and 200 MHz and
the bus operating at 66 MHz
200 MHz is the ”default” frequency
mpsim.x -h — show other options
Simulation results are in the file stats.txt
‹#›
‹#› of 14
Design-Space Exploration
Platform optimization
Select
Select
Select
Select
the
the
the
the
number of processors
speed of each processor
type, associativity, and size of the cache
bus type
Application optimization
Select the interprocessor communication style (shared
memory or distributed message passing)
Select the best mapping and schedule
‹#›
‹#› of 14
Assignment 1
Given a GSM codec
Running on one ARM7 processor
Variables
Cache parameters
Processor frequency
Using MPARM, find a hardware configuration that
minimizes the energy of the system
‹#›
‹#› of 14
Energy/Speed Tradeoff
CPU model
0.75V, 60mW
150MHz
RUN
1.3V, 450mW
RUN
600MHz
RUN
1.6V, 900mW
RUN
160ms
800MHz
RUN
10ms
1.5ms
10ms
IDLE
40mW
140ms
90ms
SLEEP
160mW
‹#›
‹#› of 14
Energy [mJ]
Frequency Selection: ARM Core Energy
2.8
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
1
1.5
2
2.5
3
3.5
4
Freq. divider
‹#›
‹#› of 14
Frequency Selection: Total Energy
11
Energy [mJ]
10.5
10
9.5
9.0
8.5
1
1.5
2
2.5
3
3.5
4
Freq. divider
‹#›
‹#› of 14
Instruction Cache Size: Total Energy
12.5
Energy [mJ]
12.0
11.5
11.0
10.5
10.0
9.50
9.00
8.50
9
29=512 bytes
10
11
12
log2(CacheSize)
13
14
214=16 kbytes
‹#›
‹#› of 14
Instruction Cache Size: Execution Time
1e+08
9.5e+07
t [cycles]
9e+07
8.5e+07
8e+07
7.5e+07
7e+07
6.5e+07
6e+07
5.5e+07
5e+07
9
10
11
12
13
14
log2(CacheSize)
‹#›
‹#› of 14
Interprocessor Data Communication
CPU1
...
a=1
...
How?
CPU2
...
print a;
...
BUS
‹#›
‹#› of 14
Shared Memory
CPU1
...
a=1
...
Shared Mem
a
CPU2
a=2
print a;
BUS
a=?
Synchronization
‹#›
‹#› of 14
Synchronization
With semaphores
CPU1
Semaphore
a=1
sem_a
signal(sem_a)
Shared Mem
CPU2
a=2
wait(sem_a)
a=2
print a;
a
BUS
‹#›
‹#› of 14
Synchronization Internals (1)
CPU1
Semaphore
a=1
sem_a=1
sem_a
signal(sem_a)
Shared Mem
CPU2
while
(sem_a==0)
wait(sem_a)
a=2
print a;
a
BUS
‹#›
‹#› of 14
Synchronization Internals (2)
Disadvantages of polling
Results in higher power consumption (energy
from the battery)
Larger execution time of the application
Blocking important communication on the bus
‹#›
‹#› of 14
Distributed Message Passing
Instead
Direct CPU-CPU communication with distributed
semaphores
Each CPU has its own scratchpad
Smaller and faster than a RAM
Smaller energy consumption than a cache
Put frequently used variables on the scratchpad
Cache controlled by hardware (cache lines,
hits/misses, …)
Scratchpad controlled by software (e.g., compiler)
Semaphores allocated on scratchpads
No polling
‹#›
‹#› of 14
Distributed Message Passing (1)
CPU1
a=1
signal(sem_a)
Shared Mem
a
CPU2
wait(sem_a)
a=2
print a;
sem_a
BUS
‹#›
‹#› of 14
Distributed Message Passing (2)
a=1
CPU1(prod)
a=1
signal(sem_a)
CPU2 (cons)
wait(sem_a)
print a;
sem_a
a=1
BUS
‹#›
‹#› of 14
Assignment 2
Given two implementations of the GSM codec
Shared memory
Distributed message passing
Simulate and compare these two approaches
Energy
Runtime
‹#›
‹#› of 14
Thank you!
Questions?
‹#›
‹#› of 14