Precision-Timed (PRET) Machines

advertisement
Precision-Timed (PRET) Machines
Stephen A. Edwards
Sungjun Kim
Edward A. Lee
Ben Lickly
Isaac Liu
Hiren D. Patel
Jan Reineke <speaker>
Columbia University
Columbia University
UC Berkeley
UC Berkeley
UC Berkeley
University of Waterloo
Saarland University UC Berkeley
Designing Next-Generation Real-Time Streaming Systems
Tutorial at HIPEAC 2013
Current Timing Verification Process
C
Program
Ptolemy
Model
Code
Generator
Compiler
Timing
Requirements
Hardware
Realization
Binary
WCET
Analysis
✓
✗
Reineke et al., Saarland 2
Current Timing Verification Process
C
Program
Ptolemy
Model
Code
Generator
Timing
Requirements
New Architecture è
New Analysis &
Recertification
¢  For modern architectures:
Extremely time-consuming,
costly and error-prone
Hardware
Architecture
Realization
Architecture
Architecture
Binary
Compiler
¢ 
Boeing:
40 years
supply of
WCET
WCET
WCET
Analysis
WCET
Analysis
Analysis
Analysis
✓
✗
Reineke et al., Saarland 3
Lack of Suitable Abstractions
for Real-Time Systems
Higher-level Model of Computation
Code
Generation
C-level Programming Language
Abstracts from
execution time
Increasingly
“unpredictable”
Compilation
Instruction Set Architecture (ISA)
Execution
Hardware Realizations
Reineke et al., Saarland 4
Agenda of this PRET (and this
presentation)
Higher-level Model of Computation
Code
Generation
C-level Programming Language
Endow with temporal
semantics and control
over timing
Development of the
PTARM, a predictable
hardware realization
Compilation
Instruction Set Architecture (ISA)
Execution
Hardware Realizations
Reineke et al., Saarland 5
Agenda of this PRET (and this
presentation)
Higher-level Model of Computation
Code
Generation
C-level Programming Language
Endow with temporal
semantics and control
over timing
Development of the
PTARM, a predictable
hardware realization
Compilation
Instruction Set Architecture (ISA)
Execution
Hardware Realizations
Reineke et al., Saarland 6
Adding Control over Timing to the ISA
Capability 1: “delay until”
Some possible capabilities in an ISA:
¢ 
[C1] Execute a block of code taking at least a
specified time
Code block
delay_
until
1 second
time
Code block
1 second
time
Where could this be useful?
-  Finishing early is not always better:
-  Scheduling Anomalies (Graham’s anomalies)
-  Communication protocols may expect periodic behavior
-  …
Reineke et al., Saarland 7
Adding Control over Timing to the ISA
Capabilities 2+3: “late” and “immediate miss detection”
¢ 
[C2] Conditionally branch if the specified time was
exceeded.
branch_expired
Code block
time
1 second
¢ 
[C3] If the specified time is exceeded during
execution of the block, branch immediately to an
exception handler.
exception_on_expire
Code block
1 second
time
Reineke et al., Saarland 8
Applications of Variants 2+3
“late” and “immediate miss detection”
¢ 
[C3] “immediate miss detection”:
l 
l 
l 
¢ 
Runtime detection of missed deadlines to initiate
error handling mechanisms
Anytime algorithms
However: unknown state after exception is taken
[C2] “late miss detection”:
l  No problems with unknown state of system
l  Change parameters of algorithm to meet future
deadlines
Reineke et al., Saarland 9
PRET Assembly Instructions
Supporting these Four Capabilities
set_time %r, <val>
– loads current time + <val> into %r
delay_until %r
– stall until current time >= %r
branch_expired %r, <target>
– branch to target if current time > %r
exception_on_expire %r, <id>
– arm processor to throw exception <id> when current time > %r
deactivate_exception <id>
– disarm the processor for exception <id>
Reineke et al., Saarland 10
Controlled Timing in
Assembly Code
[C1] Delay un-l: set_time r1, 1s
// Code block
delay_until r1
[C2] Late miss detec-on set_time r1, 1s
// Code block
branch_expired r1, <target>
delay_until r1
[C3] Immediate miss detec-on set_time r1, 1s
exception_on_expire r1, 1
// Code block
deactivate_exception 1
delay_until r1
Reineke et al., Saarland 11
MTFD – Meet the F(inal) Deadline
Capability [C1] ensures that a block of code
takes at least a given time.
¢  [C4] “MTFD”: Execute a block of code
taking at most the specified time.
¢ 
Being arbitrarily “slow” is always possible and “easy”. But what about being “fast”? [C4] Exact execu-on: set_time r1, 1s
// Code block
MTFD r1
delay_until r1
Reineke et al., Saarland 12
Current Timing Verification Process
C
Program
Ptolemy
Model
¢ 
¢ 
Code
Generator
Compiler
Timing
Requirements
New Architecture è
Recertification
Extremely time-consuming
and costly
Hardware
Architecture
Realization
Architecture
Architecture
Binary
WCET
WCET
WCET
Analysis
WCET
Analysis
Analysis
Analysis
✓
✗
Reineke et al., Saarland 13
The Future Timing Verification Process
C Program
w/ Deadline
Instructions
Compiler
Binary
Hardware
Architecture
Realization
Architecture
Architecture
Timing
Analysis
Ptolemy
Model
Code
Generator
✓
¢ 
¢ 
¢ 
✗
Timing is property of ISA
Compiler can check constraints once and for all
Downside: little flexibility in development of hardware
realizations
Reineke et al., Saarland 14
The Future Timing Verification Process:
Flexibility through Parameterization
C Program
w/ Deadline
Instructions
Compiler
Parametric
Timing
Analysis
Ptolemy
Model
¢ 
¢ 
¢ 
Code
Generator
Binary
Const
raints
Hardware
Architecture
Realization
Architecture
Architecture
✓
✗
ISA leaves more freedom to implementations through a
parameterized timing model
Compiler generates constraints on parameters which
are sufficient to meet the timing constraints
Parametric timing analysis is ongoing work
Reineke et al., Saarland 15
The Future Timing Verification Process:
Flexibility through Parameterization
C Program
w/ Deadline
Instructions
Compiler
Parametric
Timing
Analysis
Ptolemy
Model
Code
Generator
Binary
Const
raints
Hardware
Architecture
Realization
Architecture
Architecture
✓
✗
Possible parameters:
¢  Latencies of different components, such as the
pipeline, scratchpad memory, main memory, buses
¢  Sizes of buffers, such as scratchpad memories or
caches.
Reineke et al., Saarland 16
The Future Timing Verification Process:
Flexibility through Parameterization
C Program
w/ Deadline
Instructions
Compiler
Parametric
Timing
Analysis
Ptolemy
Model
Code
Generator
Binary
Const
raints
Hardware
Architecture
Realization
Architecture
Architecture
✓
✗
Challenge: Parameterization should allow for:
¢  Efficient and accurate parametric timing analysis, and
¢  Admit a wide variety of cost-efficient hardware
realizations.
Reineke et al., Saarland 17
Agenda of this PRET (and this
presentation)
Higher-level Model of Computation
Code
Generation
C-level Programming Language
Endow with temporal
semantics and control
over timing
Development of the
PTARM, a predictable
hardware realization
Compilation
Instruction Set Architecture (ISA)
Execution
Hardware Realizations
Reineke et al., Saarland 18
Hardware Realizations:
Challenges to deliver predictable timing
¢ 
¢ 
¢ 
¢ 
¢ 
Pipelining
Memory hierarchy: Caches, DRAM
On-chip communication
I/O (DMA, interrupts)
Resource sharing (e.g. in multicore architectures)
Reineke et al., Saarland 19
Processor
Design 101
First Problem:
Pipelining
Hennessey and Patterson, Computer Architecture: A Quantitative Approach, 2007.
from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007.
Reineke et al., Saarland 20
First Problem:
Pipelining
Pipeline
It!
Hennessey and Patterson, Computer Architecture: A Quantitative Approach, 2007.
from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007.
Reineke et al., Saarland 21
Pipelining:
GreatHazards
Except for Hazards
Hennessey
and Patterson,
Computer
Architecture:
A Quantitative
Approach,
2007.2007.
from Hennessy
and Patterson,
Computer
Architecture:
A Quantitative
Approach,
Reineke et al., Saarland 22
Forwarding helps, but not all the time…
...But It Does Not Solve Everything...
LD R1, 45(r2)
DADD R5, R1, R7
BE R5, R3, R0
ST R5, 48(R2)
Unpipelined
The Dream
The Reality
F D E MW F D E MW F D E MW F D E MW
F D E MW
F D E MW
F D E MW
F D E MW
F D E MW
Memory Hazard
F D
E MW
Data Hazard
F D
E MW
F D E M W Branch Hazard
Reineke et al., Saarland 23
+
Our Solution: Thread-interleaved Pipelines
Our Solution: Thread-Interleaved
Pipelines
An old idea from
the 1960s
+
An old idea from the 1960s
T1: F D E M W F D E M W
T2:
F D E MW F D E MW
T3:
F D E MW F D E MW
T4:
F D E MW F D E MW
T5:
F D E MW F D E MW
But what about memory?
Each thread occupies only one stage of the pipeline at a time
àFNo
T1:
D hazards;
E M W F Dperfect
E M W utilization of pipeline
à Simple
(no forwarding, etc.)
T2:
F D E hardware
M W F D Eimplementation
MW
T3:
F D E MW F D E MW
T4:
F D
E MW F
D E MW
Drawback:
reduced
single-thread
T5:
F D E MW F D E MW
But what about memory?
performance
Lee and Messerschmitt,
Pipeline Interleaved
Programmable DSPs,
ASSP-35(9),
1987.
Reineke
et al.,
Saarland 24
Second Memory
Problem: Memory
Hierarchy
econd Problem:
Hierarchy
from Hennessy
and Patterson,
Computer
Architecture:
A Quantitative
Approach,
2007.2007
Hennessey
and Patterson,
Computer
Architecture:
A Quantitative
Approach,
4th edition,
•  Register
is a temporary
memory
under
program
Register
file is afile
temporary
memory
under
program
control.

control.
Instruction word size.
Why is it so small?
•  Cache is a temporary memory under hardware control.
Cache is a temporary memory under hardware control.

PRET principle: any temporary memory is under program
Why
is replacement strategy application independent?
control.
Reineke et al., Saarland 25
Separation of concerns.
PRET principles
implies
PRET principle
implies using
a Scratchpad in
scratchpad
than a cache.
placerather
of cache
Hardware
Hardware
thread
Hardware
thread
Hardware
thread
thread
registers
Interleaved
pipeline with one
set of registers
per thread
scratc
h
pad
SRAM
scratchpad
shared among
threads
memory
I/O devices
DRAM main
memory
Lee, Berkeley 22
Reineke et al., Saarland 26
What about the main memory?
Dynamic RAM Organization Overview
DRAM Device
Set of DRAM banks +
•  Control logic
•  I/O gating
Accesses to banks can be pipelined,
however I/O + control logic are shared
DRAM Cell
leaks charge è needs to be
refreshed (every 7.8µs for
DDR2/DDR3)
therefore “dynamic”
DIMM
addr+cmd
Capacitor
Capacitor
Bit line
Transistor
command
DRAM
Array
chip select
Control
Logic
Mode
Register
DRAM Device
data
Row
Address
Mux
Bank
Bank
Bank
Bank
data
data
address
x16
Device
x16
Device
x16
Device
Address
Register
x16
Device
x16
Device
Column Decoder/
Multiplexer
x16
Device
x16
Device
Rank 0
Rank 1
16
64
Refresh
Counter
Sense Amplifiers
and Row Buffer
x16
Device
16
I/O
Gating
I/O Registers
+ Data I/O
Row
Address
Word line
Row Decoder
Bank
data
16
data
16
data
16
chip select 0
chip select 1
DRAM Bank
= Array of DRAM Cells
+ Sense Amplifiers and
Row Buffer
Sharing of sense
amplifiers and row buffer
DRAM Module
Collection of DRAM Devices
•  rank = groups of devices
that operate in unison
•  Ranks share data/address/
command bus
Reineke et al., Saarland 27
DRAM Timing Constraints
¢ 
¢ 
DRAM Memory Controllers have to conform to
different timing constraints
Almost all of these constraints are due to
competition for resources at different levels:
Within the DRAM banks:
rows are sharing sense amplifiers
l  Within a DRAM device:
sharing of I/O gating and control logic
l  Between different ranks:
sharing of data/address/command busses
l 
Reineke et al., Saarland 28
PRET DRAM Controller: Exploiting
Internal Structure of DRAM Module
l 
Consists of 4-8 banks in 1-2 ranks
•  Share only command and data bus, otherwise independent
Partition into four groups of banks in alternating ranks
l  Cycle through groups in a time-triggered fashion
l 
Rank 0:
Bank
0
Bank
1
Bank
2
Bank
3
Rank 1:
Bank
0
Bank
1
Bank
2
Bank
3
•  Successive accesses to
same group obey timing
constraints
•  Reads/writes to different
groups do not interfere
Provides four
independent and
predictable resources
Reineke et al., Saarland 29
General-Purpose DRAM Controller
vs PRET DRAM Controller
General-Purpose Controller
¢  Abstracts DRAM as a
single shared resource
¢  Schedules refreshes
dynamically
¢ 
¢ 
Schedules commands
dynamically
“Open page” policy
speculates on locality
PRET DRAM Controller
¢  Abstracts DRAM as multiple
independent resources
¢  Refreshes as reads:
shorter interruptions
¢  Defer refreshes:
improves perceived latency
¢  Follows periodic, timetriggered schedule
¢  “Closed page” policy:
access-history independence
Reineke et al., Saarland 30
Conventional DRAM Controller
vs PRET DRAM Controller:
Latency Evaluation
Varying Interference:
Varying Transfer Size:
2,000
1,000
0
Conventional controller
PRET controller
3,000
average latency [cycles]
latency [cycles]
3,000
0
0.5
1
1.5
2
2.5
3
Interference [# of other threads occupied]
4096B transfers, conventional controller
4096B transfers, PRET controller
1024B transfers, conventional controller
1024B transfers, PRET controller
Figure 9: Latencies of conventional and PRET memory controller with varying interference from other threads.
2,000
1,000
0
0
1,000
2,000
3,000
4,000
transfer size [bytes]
Figure 10: Latencies of conventional and PRET memory controller with maximum load by interfering threads and varying
transfer size.
same under all different thread occupancies. This demonstrates the
temporal isolation achieved by the PRET DRAM controller. Any
timing analysis on the memory latency for one thread only needs
to be done in the context of that
thread. Weetalso
the range of31
Reineke
al.,seeSaarland
PRET DRAM Controller vs Predator
150
latency [cycles]
125
100
Predator:
•  abstracts DRAM as
single resource
•  uses standard refresh
mechanism
Private resources in backend
“Manual” refreshes
75
Hiding refreshes
50
25
0
32
64
96
128
160
192
224
256
size of transfer [bytes]
Shared Predator BL = 4 w/ refreshes
DLr4,4 (x): Shared PRET BL = 4 w/ refreshes
DLr (x): PRET BL = 4 w/ refreshes
DLr (x): PRET BL = 4 w/o refreshes
Figure 7: Latencies for small request sizes up to 256 bytes under Predator and PRET at burst length 4. In this, and all of the
è  PRET’s worst-case
access latency of small
transfers is smaller
than Predator’s
è  PRET’s drawback:
memory is private
Reineke et al., Saarland 32
Resulting PRET Architecture
PTARM
Memory
We have realized
this
in PTArm,Hierarchy
a soft core on a Xilinx Virtex 5 FPGA
Hardware
Hardware
thread
Hardware
thread
Hardware
thread
thread
registers
Interleaved
pipeline with one
set of registers
per thread
scratc
h
pad
SRAM
scratchpad
shared among
threads
memory
memory
memory
memory
DRAM main
memory,
separate banks
per thread
Note inverted memory
compared to multicore!
Fast, close memory is
shared, slow remote
memory is private!
I/O devices
Note inverted memory
hierarchy!
Reineke et al.,
Saarland
33
Lee,
Berkeley 24
Conclusions
¢ 
Real-time computing needs real-time abstractions
¢ 
Potential for significant improvements in worst-case
performance of some hardware realizations
¢ 
For more information on PRET: http://chess.eecs.berkeley.edu/pret/
Raffaello Sanzio da Urbino – The Athens School
Lee, Berkeley 34
References
¢ 
[ICCD ‘12] Isaac Liu, Jan Reineke, David Broman, Michael Zimmer, Edward A. Lee.
A PRET Microarchitecture Implementation with Repeatable Timing and Competitive
Performance, To appear in Proceedings of International Conference on Computer
Design (ICCD), October, 2012.
¢ 
[CODES ’11] Jan Reineke, Isaac Liu, Hiren D. Patel, Sungjun Kim, Edward A. Lee,
PRET DRAM Controller: Bank Privatization for Predictability and Temporal Isolation,
International Conference on Hardware/Software Codesign and System Synthesis (CODES
+ISSS), October, 2011.
[DAC ’11] Dai Nguyen Bui, Edward A. Lee, Isaac Liu, Hiren D. Patel, Jan Reineke,
Temporal Isolation on Multiprocessing Architectures, Design Automation Conference (DAC),
June, 2011.
[Asilomar ’10] Isaac Liu, Jan Reineke, and Edward A. Lee,
PRET Architecture Supporting Concurrent Programs with Composable Timing Properties, in
Signals, Systems, and Computers (ASILOMAR), Conference Record of the Forty Fourth
Asilomar Conference, November 2010, Pacific Grove, California.
[CASES ’08] Ben Lickly, Isaac Liu, Sungjun Kim, Hiren D. Patel, Stephen A. Edwards and
Edward A. Lee, "
Predictable Programming on a Precision Timed Architecture," in Proceedings of International
Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES),
Piscataway, NJ, pp. 137-146, IEEE Press, October, 2008.
¢ 
¢ 
¢ 
Reineke et al., Saarland 35
Download