TDTS30: Design space exploration
Outline
§
§
§
§
§
Introduction
Design Flow
Design Space Exploration
Communication in MPSoCs
Mapping & Scheduling in MPSoCs
2
1
Administrative Info
§ Labs in the SU rooms
§ ssh to a remote linux computer from the pool
(sy00-1.ida.liu.se – sy007.ida.liu.se)
§ Ex: ssh su00-3.ida.liu.se
§ source /home/TDTS30/sw/mparm/go_mparm.sh
3
GSM
GSMPhone:
Phone:
§Search
§Search
§Radio
§RadioLink
LinkControl
Control
§Talking
§Talking
MP3
MP3player
player
Digital
DigitalCamera:
Camera:
§Take
§TakePhoto
Photo
§Restore
§RestorePhoto
Photo
...
...
§High performance
§Time to market
§Low power
4
2
MPSoC Architecture
ARM
ARM
ARM
CACHE
CACHE
CACHE
Interrupt
Device
Bus (AMBA, ST)
Private Private Private Semaphore Shared
Memory Memory Memory Device
Memory
5
System Design Flow
Informal specification,
constraints
Modelling
Functional
simulation
Architecture
selection
System model
System
architecture
Mapping
Estimation
Scheduling
not ok
Mapped and
scheduled model
not ok
Hardware and
Software
Implementation
Testing
ok
not ok
Prototype
Fabrication
6
3
System Design Flow
Informal specification,
constraints
Modelling
Functional
simulation
Architecture
selection
System model
System
architecture
Mapping
Estimation
Scheduling
not ok
Mapped and
scheduled model
MPARM
not ok
Hardware and
Software
Implementation
Testing
ok
not ok
Prototype
Fabrication
7
System Design Flow
Hardware
platform
Software
Application(s)
Extract
Task Graph
Extract Task
Parameters
Optimize
(Mapping & Sched)
Formal
Simulation
Implement
8
4
MPARM: Hardware
§
§
§
§
§
§
§
§
Simulation platform
Up to 8 ARM7 processors
Variable frequency (dynamic and static)
Split/Unified instruction and data caches
Private memory for each processor
Scratchpad
Shared Memory
Bus: AMBA, ST
§ This platform can be fine tuned for a given
application
9
MPARM: Hardware
This platform can be fine tuned for a given
application. Read more in:
/home/TDTS30/sw/mparm/MPARM/doc/simulator_statistics.txt
10
5
MPARM: Software
§ C crosscompiler chain for building the
software applications
§ No operating system
§ Small set of functions (pr, shared_alloc,
WAIT/SIGNAL)
§ RTEMS operating system
11
MPARM: Why?
§ Cycle accurate simulation of the system
§ Various interesting statistics: number of clock
cycles executed, bus utilization, cache
efficiency, energy/power consumption of the
componenets (CPU cores, bus, memories)
12
6
MPARM: How?
§ mpsim.x –c2 runs a simulation on 2 processors,
collecting the default statistics
§ mpsim.x –c2 –w runs a simulation with power/energy
statistics
§ mpsim.x –c1 –is 9 –ds 10 :one processor with
instruction cache of size 512 and data cache of 1024
bytes
§ mpsim.x –c2 –F0,2 –F1,1 –F3,3 :two processors
running at 100 MHz, 200 MHz and the bus running at
66 MHz
§ mpsim.x –h for the rest
§ Simulation results are in the file stats.txt
13
Design Space Exploration
§ Generic term for system optimization
§ Platform optimization:
§
§
§
§
Select the number of processors
Select the speed of each processor
Select the type, associativity and size of the cache
Select the bus type
§ Application optimization
§ Select the interprocessor communication style
(shared memory or distributed message passing)
§ Select the best mapping and schedule
14
7
Design Space Exploration for the GSM codec
Assignment 1
§ Given the GSM code
§ Running on 1 ARM7 processor
§ The variables are:
§ cache parameters
§ processor frequency
§ Using the MPARM simulator, find a
hardware configuration that minimizes the
energy of the system
15
Energy/Speed Tradeoff
0.75V, 60mW
150MHz
RUN
1.3V, 450mW
RUN
600MHz
RUN
1.6V, 900mW
RUN
160µs
800MHz
RUN
10µs
1.5ms
10µs
IDLE
40mW
140ms
90µs
SLEEP
160µW
16
8
Frequency Selection: Total Energy
11
Energy [mJ]
10.5
10.0
9.50
9.00
8.50
1
1.5
2
2.5
3
3.5
Freq. devider
4
50 MHz
200 MHz
17
Energy [mJ]
Frequency Selection: ARM Core Energy
2.8
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
1
1.5
2
2.5
3
3.5
4
Freq. devider
18
9
Instruction Cache Size: Total Energy
12.5
Energy [mJ]
12.0
11.5
11.0
10.5
10.0
9.50
9.00
8.50
9
10
11
12
13
log2(CacheSize)
29=512 bytes
14
214=16 kbytes
19
Instruction Cache Size: Execution Time
1e+08
9.5e+07
t [cycles]
9e+07
8.5e+07
8e+07
7.5e+07
7e+07
6.5e+07
6e+07
5.5e+07
5e+07
9
10
11
12
13
14
log2(CacheSize)
20
10
Interprocessor Data Communication
CPU1
...
a=1
...
CPU2
...
print a;
...
BUS
21
Shared Memory
CPU1
...
a=1
...
Shared Mem
a
CPU2
a=2
print a;
BUS
a=?
Synchronization
22
11
Synchronization
CPU1
Semaphore
a=1
sem_a
signal(sem_a)
Shared Mem
CPU2
a=2
wait(sem_a)
a=2
print a;
a
BUS
23
Synchronization In[f/t]ernals
CPU1
Semaphore
a=1
po
llin
g
sem_a=1
sem_a
signal(sem_a)
Shared Mem
CPU2
while
(sem_a==0)
wait(sem_a)
a=2
print a;
a
BUS
24
12
Distributed Message Passing
CPU1
a=1
signal(sem_a)
Shared Mem
a
CPU2
wait(sem_a)
a=2
print a;
sem_a
BUS
25
Distributed Message Passing (2)
a=1
CPU1(prod)
a=1
signal(sem_a)
CPU2 (cons)
wait(sem_a)
print a;
sem_a
a=1
BUS
26
13
Assignment 2
§ You are given 2 implementations of the GSM
codec
§ Shared memory
§ Distributed message passing
§ Simulate and compare these 2 approaches
27
System Design Flow
Hardware
platform
Software
Application(s)
Extract
Task Graph
Extract Task
Parameters
Optimize
(Mapping & Sched)
Formal
Simulation
Implement
28
14
Task Graph Extraction Example
for (i=0;i<100;i++)a[i]=1;//TASK 1
for (i=0;i<100;i++) b[i]=1;//TASK 2
for (i=0;i<100;i++) c[i]=a[i]+b[i];//TASK 3
τ1
τ2
τ3
Task 1 and 2 can be executed in parallel
Task 3 has data dependency on 1 and 2
29
Execution Time Extraction
§ Using the simulator
§ This an ”average” execution time
§ In real-time systems, another execution
time is used: worst-case (WCET)
§ Can be extracted using the
dump_light_metric(int x) in MPARM
30
15
Execution Time Extraction Example
start_metric();
for (i=0;i<100;i++) a[i]=1;//TASK 1
dump_light_metric(1);
for (i=0;i<100;i++) a[i]=1;//TASK 2
dump_light_metric(2);
stop_metric();
stop_simulation();
31
Execution Time Extraction Example (2)
Task 1
Interconnect statistics
----------------------Overall exec time
Task NC
1-CPU average exec time
Concurrent exec time
Bus busy
Bus transferring data
of 144)
----------------------Task 2
Interconnect statistics
----------------------Overall exec time
Task NC
1-CPU average exec time
Concurrent exec time
Bus busy
Bus transferring data
39.73% of 813)
=
=
=
=
=
=
287 system cycles (1435 ns)
287
0 system cycles (0 ns)
287 system cycles (1435 ns)
144 system cycles (50.17% of 287)
64 system cycles (22.30% of 287, 44.44%
=
=
=
=
=
=
5554 system cycles (27770 ns)
5267
0 system cycles (0 ns)
5554 system cycles (27770 ns)
813 system cycles (14.64% of 5554)
323 system cycles (5.82% of 5554,
32
16
Application Mapping and Scheduling
CPU1
τ2
τ3
dl=6
τ1
dl=3
CPU0
τ4
dl=9
Bus
τ0
+
=
CPU2
τ5
γ4-5
τ1
τ2
γ1-3
=
τ0
τ4
γ3-5
CPU0
γ2-4
γ0-2
CPU1
τ3
CPU2
33
Mapping in MPARM
§ Using the RTEMS OS -> not this time
§ Using the lightweight API of MPARM
§ get_proc_id()
34
17
Mapping in MPARM Example
if (get_proc_id()==1) {
//Task 1 executed on processor 1
for (i=1;i<100;i++) a[i]=1;
}
if (get_proc_id()==2) {
//Task 1 executed on processor 2
for (i=1;i<100;i++) b[i]=1;
}
35
Scheduling in MPARM
§ Using the RTEMS OS -> not this time
§ The schedule is given by the code sequence
executed on one processor
//task 1
for(i=1;i<100;i++)a[i]=1;
//task 2
for(i=1;i<100;i++)b[i]=3;
Schedule 2
Schedule 1
//task 2
for(i=1;i<100;i++)b[i]=3;
//task 1
for(i=1;i<100;i++)a[i]=1;
36
18
Assignment 3: Mapping and Scheduling
§
§
§
§
§
GSM codec
Task graph is given
Extract the task execution times
Given a mapping, improve the scheduling
Find a better mapping
37
19