Lab 2: Formal Verification with UPPAAL ‹#› of 14

advertisement
Lab 2: Formal Verification with UPPAAL
‹#› of 14
The Gossiping Persons
There are n people. Each has a secret. They are
desperate to tell their secrets to each other. They
communicate over the phone. When two people are
on the phone, they exchange the secrets they
currently know. What is the minimum number of
calls needed so that each person knows all secrets?
 How to model phone calls? Channels.
 How to represent secrets? Bit flags.
 How to exchange secrets? Global variables.
‹#›
‹#› of 14
Lab 3: Design-Space Exploration with MPARM
‹#› of 14
Outline
 The MPARM simulation framework





Hardware
Software
Design-space exploration
Communication in MPSoC
Mapping and scheduling
‹#›
‹#› of 14
MPSoC Architecture
ARM
ARM
ARM
CACHE
CACHE
CACHE
Interrupt
Device
Bus
Private Private Private Semaphore Shared
Memory Memory Memory Device
Memory
‹#›
‹#› of 14
System Design Flow
Informal specification,
constraints
Modeling
Functional
simulation
Architecture
selection
System model
System
architecture
Mapping
Estimation
Scheduling
not ok
Mapped and
scheduled model
not ok
Hardware and
Software
Implementation
Testing
ok
not ok
Prototype
Fabrication
‹#›
‹#› of 14
System Design Flow
Hardware
platform
Software
Application(s)
Extract
Task Graph
Extract Task
Parameters
Optimize
(Mapping & Sched)
Formal
Simulation
Implement
‹#›
‹#› of 14
MPARM: Hardware








ARM7 processors (up to eight)
Variable frequency (dynamic and static)
Instruction and data caches
Private memory
Scratchpad
Shared memory
Communication bus
Read more in:

/home/TDTS07/sw/mparm/MPARM/doc/simulator_statistics.txt
‹#›
‹#› of 14
MPARM: Software
 Cross-compiler toolchain for building software
 No operating system
 Small set of functions (such as WAIT and SIGNAL)
 (look in the application code)
‹#›
‹#› of 14
MPARM: Why?
 Cycle-accurate simulation of the system
 Various statistics: number of clock cycles
executed, bus utilization, cache efficiency,
and energy/power consumption of the
components (CPUs, buses, and memories)
‹#›
‹#› of 14
MPARM: How?
 mpsim.x -c2 — run on two processors, collecting
default statistics
 mpsim.x -c2 -w — run on two processors, collecting
power/energy statistics
 mpsim.x -c1 --is=9 --ds=10 — run on one processor
with instruction cache of 512 bytes and data cache of
1024 bytes
 mpsim.x -c2 -F0,2 -F1,1 -F3,3 — run on two
processors operating at 100 MHz and 200 MHz and
the bus operating at 66 MHz
 200 MHz is the ”default” frequency
 mpsim.x -h — show other options
 Simulation results are in the file stats.txt
‹#›
‹#› of 14
Design-Space Exploration
 Platform optimization




Select
Select
Select
Select
the
the
the
the
number of processors
speed of each processor
type, associativity, and size of the cache
bus type
 Application optimization
 Select the interprocessor communication style (shared
memory or distributed message passing)
 Select the best mapping and schedule
‹#›
‹#› of 14
Assignment 1
 Given a GSM codec
 Running on one ARM7 processor
 Variables
 Cache parameters
 Processor frequency
 Using MPARM, find a hardware configuration that
minimizes the energy of the system
‹#›
‹#› of 14
Energy/Speed Tradeoff
CPU model
0.75V, 60mW
150MHz
RUN
1.3V, 450mW
RUN
600MHz
RUN
1.6V, 900mW
RUN
160ms
800MHz
RUN
10ms
1.5ms
10ms
IDLE
40mW
140ms
90ms
SLEEP
160mW
‹#›
‹#› of 14
Energy [mJ]
Frequency Selection: ARM Core Energy
2.8
2.6
2.4
2.2
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
1
1.5
2
2.5
3
3.5
4
Freq. divider
‹#›
‹#› of 14
Frequency Selection: Total Energy
11
Energy [mJ]
10.5
10
9.5
9.0
8.5
1
1.5
2
2.5
3
3.5
4
Freq. divider
‹#›
‹#› of 14
Instruction Cache Size: Total Energy
12.5
Energy [mJ]
12.0
11.5
11.0
10.5
10.0
9.50
9.00
8.50
9
29=512 bytes
10
11
12
log2(CacheSize)
13
14
214=16 kbytes
‹#›
‹#› of 14
Instruction Cache Size: Execution Time
1e+08
9.5e+07
t [cycles]
9e+07
8.5e+07
8e+07
7.5e+07
7e+07
6.5e+07
6e+07
5.5e+07
5e+07
9
10
11
12
13
14
log2(CacheSize)
‹#›
‹#› of 14
Interprocessor Data Communication
CPU1
...
a=1
...
How?
CPU2
...
print a;
...
BUS
‹#›
‹#› of 14
Shared Memory
CPU1
...
a=1
...
Shared Mem
a
CPU2
a=2
print a;
BUS
a=?
Synchronization
‹#›
‹#› of 14
Synchronization
With semaphores
CPU1
Semaphore
a=1
sem_a
signal(sem_a)
Shared Mem
CPU2
a=2
wait(sem_a)
a=2
print a;
a
BUS
‹#›
‹#› of 14
Synchronization Internals (1)
CPU1
Semaphore
a=1
sem_a=1
sem_a
signal(sem_a)
Shared Mem
CPU2
while
(sem_a==0)
wait(sem_a)
a=2
print a;
a
BUS
‹#›
‹#› of 14
Synchronization Internals (2)
 Disadvantages of polling
 Results in higher power consumption (energy
from the battery)
 Larger execution time of the application
 Blocking important communication on the bus
‹#›
‹#› of 14
Distributed Message Passing
 Instead
 Direct CPU-CPU communication with distributed
semaphores
 Each CPU has its own scratchpad
 Smaller and faster than a RAM
 Smaller energy consumption than a cache
 Put frequently used variables on the scratchpad
 Cache controlled by hardware (cache lines,
hits/misses, …)
 Scratchpad controlled by software (e.g., compiler)
 Semaphores allocated on scratchpads
 No polling
‹#›
‹#› of 14
Distributed Message Passing (1)
CPU1
a=1
signal(sem_a)
Shared Mem
a
CPU2
wait(sem_a)
a=2
print a;
sem_a
BUS
‹#›
‹#› of 14
Distributed Message Passing (2)
a=1
CPU1(prod)
a=1
signal(sem_a)
CPU2 (cons)
wait(sem_a)
print a;
sem_a
a=1
BUS
‹#›
‹#› of 14
Assignment 2
 Given two implementations of the GSM codec
 Shared memory
 Distributed message passing
 Simulate and compare these two approaches
 Energy
 Runtime
‹#›
‹#› of 14
System Design Flow
Hardware
platform
Software
Application(s)
Extract
Task Graph
Extract Task
Parameters
Optimize
(Mapping & Sched)
Formal
Simulation
Implement
‹#›
‹#› of 14
Task Graph Extraction Example
for (i=0;i<100;i++) a[i]=1; // T1
for (i=0;i<100;i++) b[i]=1; // T2
for (i=0;i<100;i++) c[i]=a[i]+b[i]; // T3
t1
t2
t3
Task 1 and 2 can be executed in parallel
Task 3 has data dependency on 1 and 2
‹#›
‹#› of 14
Execution Time Extraction
 Using the simulator
 This is an ”average” execution time
 Can be extracted using the
dump_light_metric() in MPARM
‹#›
‹#› of 14
Execution Time Extraction Example (1)
start_metric();
for (i=0;i<100;i++) a[i]=1; // T1
dump_light_metric();
for (i=0;i<100;i++) b[i]=1; // T2
dump_light_metric();
stop_metric();
stop_simulation();
‹#›
‹#› of 14
Execution Time Extraction Example (2)
Task 1
Interconnect statistics
----------------------Overall exec time
Task NC
1-CPU average exec time
Concurrent exec time
Bus busy
Bus transferring data
of 144)
----------------------Task 2
Interconnect statistics
----------------------Overall exec time
Task NC
1-CPU average exec time
Concurrent exec time
Bus busy
Bus transferring data
39.73% of 813)
=
=
=
=
=
=
287 system cycles (1435 ns)
287
0 system cycles (0 ns)
287 system cycles (1435 ns)
144 system cycles (50.17% of 287)
64 system cycles (22.30% of 287, 44.44%
=
=
=
=
=
=
5554 system cycles (27770 ns)
5267
0 system cycles (0 ns)
5554 system cycles (27770 ns)
813 system cycles (14.64% of 5554)
323 system cycles (5.82% of 5554,
‹#›
‹#› of 14
t0
Application Mapping and Scheduling
CPU1
t2
t3
dl=6
CPU0
t4
dl=9
=
Bus
t1
dl=3
+
CPU2
t5
t5
g4-5
t1
t2
g1-3
=
t0
t4
g3-5
CPU0
g2-4
g0-2
CPU1
t3
CPU2
‹#›
‹#› of 14
Mapping in MPARM Example
 Using the lightweight API of MPARM
 get_proc_id()
if (get_proc_id() == 1) {
// T1 executed on CPU1
for (i=1; i<100; i++) a[i]=1;
}
if (get_proc_id() == 2) {
// T1 executed on CPU2
for (i=1; i<100; i++) b[i]=1;
}
‹#›
‹#› of 14
Scheduling in MPARM
 The schedule is given by the code sequence executed on
one processor
// T1
for(i=1;i<100;i++)a[i]=1;
// T2
for(i=1;i<100;i++)b[i]=3;
Schedule 2
Schedule 1
// T2
for(i=1;i<100;i++)b[i]=3;
// T1
for(i=1;i<100;i++)a[i]=1;
‹#›
‹#› of 14
Assignment 3: Mapping and Scheduling
 Theoretical exercise (you will not use MPARM)
 Task graph and task execution times are given
 Construct two schedules with the same minimal
length:
 keeping same task mapping
 changing the mapping
‹#›
‹#› of 14
Thank you!
Questions?
‹#›
‹#› of 14
Download