Precision-Timed (PRET) Machines Stephen A. Edwards Sungjun Kim Edward A. Lee Ben Lickly Isaac Liu Hiren D. Patel Jan Reineke <speaker> Columbia University Columbia University UC Berkeley UC Berkeley UC Berkeley University of Waterloo Saarland University UC Berkeley Designing Next-Generation Real-Time Streaming Systems Tutorial at HIPEAC 2013 Current Timing Verification Process C Program Ptolemy Model Code Generator Compiler Timing Requirements Hardware Realization Binary WCET Analysis ✓ ✗ Reineke et al., Saarland 2 Current Timing Verification Process C Program Ptolemy Model Code Generator Timing Requirements New Architecture è New Analysis & Recertification ¢ For modern architectures: Extremely time-consuming, costly and error-prone Hardware Architecture Realization Architecture Architecture Binary Compiler ¢ Boeing: 40 years supply of WCET WCET WCET Analysis WCET Analysis Analysis Analysis ✓ ✗ Reineke et al., Saarland 3 Lack of Suitable Abstractions for Real-Time Systems Higher-level Model of Computation Code Generation C-level Programming Language Abstracts from execution time Increasingly “unpredictable” Compilation Instruction Set Architecture (ISA) Execution Hardware Realizations Reineke et al., Saarland 4 Agenda of this PRET (and this presentation) Higher-level Model of Computation Code Generation C-level Programming Language Endow with temporal semantics and control over timing Development of the PTARM, a predictable hardware realization Compilation Instruction Set Architecture (ISA) Execution Hardware Realizations Reineke et al., Saarland 5 Agenda of this PRET (and this presentation) Higher-level Model of Computation Code Generation C-level Programming Language Endow with temporal semantics and control over timing Development of the PTARM, a predictable hardware realization Compilation Instruction Set Architecture (ISA) Execution Hardware Realizations Reineke et al., Saarland 6 Adding Control over Timing to the ISA Capability 1: “delay until” Some possible capabilities in an ISA: ¢ [C1] Execute a block of code taking at least a specified time Code block delay_ until 1 second time Code block 1 second time Where could this be useful? - Finishing early is not always better: - Scheduling Anomalies (Graham’s anomalies) - Communication protocols may expect periodic behavior - … Reineke et al., Saarland 7 Adding Control over Timing to the ISA Capabilities 2+3: “late” and “immediate miss detection” ¢ [C2] Conditionally branch if the specified time was exceeded. branch_expired Code block time 1 second ¢ [C3] If the specified time is exceeded during execution of the block, branch immediately to an exception handler. exception_on_expire Code block 1 second time Reineke et al., Saarland 8 Applications of Variants 2+3 “late” and “immediate miss detection” ¢ [C3] “immediate miss detection”: l l l ¢ Runtime detection of missed deadlines to initiate error handling mechanisms Anytime algorithms However: unknown state after exception is taken [C2] “late miss detection”: l No problems with unknown state of system l Change parameters of algorithm to meet future deadlines Reineke et al., Saarland 9 PRET Assembly Instructions Supporting these Four Capabilities set_time %r, <val> – loads current time + <val> into %r delay_until %r – stall until current time >= %r branch_expired %r, <target> – branch to target if current time > %r exception_on_expire %r, <id> – arm processor to throw exception <id> when current time > %r deactivate_exception <id> – disarm the processor for exception <id> Reineke et al., Saarland 10 Controlled Timing in Assembly Code [C1] Delay un-l: set_time r1, 1s // Code block delay_until r1 [C2] Late miss detec-on set_time r1, 1s // Code block branch_expired r1, <target> delay_until r1 [C3] Immediate miss detec-on set_time r1, 1s exception_on_expire r1, 1 // Code block deactivate_exception 1 delay_until r1 Reineke et al., Saarland 11 MTFD – Meet the F(inal) Deadline Capability [C1] ensures that a block of code takes at least a given time. ¢ [C4] “MTFD”: Execute a block of code taking at most the specified time. ¢ Being arbitrarily “slow” is always possible and “easy”. But what about being “fast”? [C4] Exact execu-on: set_time r1, 1s // Code block MTFD r1 delay_until r1 Reineke et al., Saarland 12 Current Timing Verification Process C Program Ptolemy Model ¢ ¢ Code Generator Compiler Timing Requirements New Architecture è Recertification Extremely time-consuming and costly Hardware Architecture Realization Architecture Architecture Binary WCET WCET WCET Analysis WCET Analysis Analysis Analysis ✓ ✗ Reineke et al., Saarland 13 The Future Timing Verification Process C Program w/ Deadline Instructions Compiler Binary Hardware Architecture Realization Architecture Architecture Timing Analysis Ptolemy Model Code Generator ✓ ¢ ¢ ¢ ✗ Timing is property of ISA Compiler can check constraints once and for all Downside: little flexibility in development of hardware realizations Reineke et al., Saarland 14 The Future Timing Verification Process: Flexibility through Parameterization C Program w/ Deadline Instructions Compiler Parametric Timing Analysis Ptolemy Model ¢ ¢ ¢ Code Generator Binary Const raints Hardware Architecture Realization Architecture Architecture ✓ ✗ ISA leaves more freedom to implementations through a parameterized timing model Compiler generates constraints on parameters which are sufficient to meet the timing constraints Parametric timing analysis is ongoing work Reineke et al., Saarland 15 The Future Timing Verification Process: Flexibility through Parameterization C Program w/ Deadline Instructions Compiler Parametric Timing Analysis Ptolemy Model Code Generator Binary Const raints Hardware Architecture Realization Architecture Architecture ✓ ✗ Possible parameters: ¢ Latencies of different components, such as the pipeline, scratchpad memory, main memory, buses ¢ Sizes of buffers, such as scratchpad memories or caches. Reineke et al., Saarland 16 The Future Timing Verification Process: Flexibility through Parameterization C Program w/ Deadline Instructions Compiler Parametric Timing Analysis Ptolemy Model Code Generator Binary Const raints Hardware Architecture Realization Architecture Architecture ✓ ✗ Challenge: Parameterization should allow for: ¢ Efficient and accurate parametric timing analysis, and ¢ Admit a wide variety of cost-efficient hardware realizations. Reineke et al., Saarland 17 Agenda of this PRET (and this presentation) Higher-level Model of Computation Code Generation C-level Programming Language Endow with temporal semantics and control over timing Development of the PTARM, a predictable hardware realization Compilation Instruction Set Architecture (ISA) Execution Hardware Realizations Reineke et al., Saarland 18 Hardware Realizations: Challenges to deliver predictable timing ¢ ¢ ¢ ¢ ¢ Pipelining Memory hierarchy: Caches, DRAM On-chip communication I/O (DMA, interrupts) Resource sharing (e.g. in multicore architectures) Reineke et al., Saarland 19 Processor Design 101 First Problem: Pipelining Hennessey and Patterson, Computer Architecture: A Quantitative Approach, 2007. from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007. Reineke et al., Saarland 20 First Problem: Pipelining Pipeline It! Hennessey and Patterson, Computer Architecture: A Quantitative Approach, 2007. from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007. Reineke et al., Saarland 21 Pipelining: GreatHazards Except for Hazards Hennessey and Patterson, Computer Architecture: A Quantitative Approach, 2007.2007. from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, Reineke et al., Saarland 22 Forwarding helps, but not all the time… ...But It Does Not Solve Everything... LD R1, 45(r2) DADD R5, R1, R7 BE R5, R3, R0 ST R5, 48(R2) Unpipelined The Dream The Reality F D E MW F D E MW F D E MW F D E MW F D E MW F D E MW F D E MW F D E MW F D E MW Memory Hazard F D E MW Data Hazard F D E MW F D E M W Branch Hazard Reineke et al., Saarland 23 + Our Solution: Thread-interleaved Pipelines Our Solution: Thread-Interleaved Pipelines An old idea from the 1960s + An old idea from the 1960s T1: F D E M W F D E M W T2: F D E MW F D E MW T3: F D E MW F D E MW T4: F D E MW F D E MW T5: F D E MW F D E MW But what about memory? Each thread occupies only one stage of the pipeline at a time àFNo T1: D hazards; E M W F Dperfect E M W utilization of pipeline à Simple (no forwarding, etc.) T2: F D E hardware M W F D Eimplementation MW T3: F D E MW F D E MW T4: F D E MW F D E MW Drawback: reduced single-thread T5: F D E MW F D E MW But what about memory? performance Lee and Messerschmitt, Pipeline Interleaved Programmable DSPs, ASSP-35(9), 1987. Reineke et al., Saarland 24 Second Memory Problem: Memory Hierarchy econd Problem: Hierarchy from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2007.2007 Hennessey and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, • Register is a temporary memory under program Register file is afile temporary memory under program control. control. Instruction word size. Why is it so small? • Cache is a temporary memory under hardware control. Cache is a temporary memory under hardware control. PRET principle: any temporary memory is under program Why is replacement strategy application independent? control. Reineke et al., Saarland 25 Separation of concerns. PRET principles implies PRET principle implies using a Scratchpad in scratchpad than a cache. placerather of cache Hardware Hardware thread Hardware thread Hardware thread thread registers Interleaved pipeline with one set of registers per thread scratc h pad SRAM scratchpad shared among threads memory I/O devices DRAM main memory Lee, Berkeley 22 Reineke et al., Saarland 26 What about the main memory? Dynamic RAM Organization Overview DRAM Device Set of DRAM banks + • Control logic • I/O gating Accesses to banks can be pipelined, however I/O + control logic are shared DRAM Cell leaks charge è needs to be refreshed (every 7.8µs for DDR2/DDR3) therefore “dynamic” DIMM addr+cmd Capacitor Capacitor Bit line Transistor command DRAM Array chip select Control Logic Mode Register DRAM Device data Row Address Mux Bank Bank Bank Bank data data address x16 Device x16 Device x16 Device Address Register x16 Device x16 Device Column Decoder/ Multiplexer x16 Device x16 Device Rank 0 Rank 1 16 64 Refresh Counter Sense Amplifiers and Row Buffer x16 Device 16 I/O Gating I/O Registers + Data I/O Row Address Word line Row Decoder Bank data 16 data 16 data 16 chip select 0 chip select 1 DRAM Bank = Array of DRAM Cells + Sense Amplifiers and Row Buffer Sharing of sense amplifiers and row buffer DRAM Module Collection of DRAM Devices • rank = groups of devices that operate in unison • Ranks share data/address/ command bus Reineke et al., Saarland 27 DRAM Timing Constraints ¢ ¢ DRAM Memory Controllers have to conform to different timing constraints Almost all of these constraints are due to competition for resources at different levels: Within the DRAM banks: rows are sharing sense amplifiers l Within a DRAM device: sharing of I/O gating and control logic l Between different ranks: sharing of data/address/command busses l Reineke et al., Saarland 28 PRET DRAM Controller: Exploiting Internal Structure of DRAM Module l Consists of 4-8 banks in 1-2 ranks • Share only command and data bus, otherwise independent Partition into four groups of banks in alternating ranks l Cycle through groups in a time-triggered fashion l Rank 0: Bank 0 Bank 1 Bank 2 Bank 3 Rank 1: Bank 0 Bank 1 Bank 2 Bank 3 • Successive accesses to same group obey timing constraints • Reads/writes to different groups do not interfere Provides four independent and predictable resources Reineke et al., Saarland 29 General-Purpose DRAM Controller vs PRET DRAM Controller General-Purpose Controller ¢ Abstracts DRAM as a single shared resource ¢ Schedules refreshes dynamically ¢ ¢ Schedules commands dynamically “Open page” policy speculates on locality PRET DRAM Controller ¢ Abstracts DRAM as multiple independent resources ¢ Refreshes as reads: shorter interruptions ¢ Defer refreshes: improves perceived latency ¢ Follows periodic, timetriggered schedule ¢ “Closed page” policy: access-history independence Reineke et al., Saarland 30 Conventional DRAM Controller vs PRET DRAM Controller: Latency Evaluation Varying Interference: Varying Transfer Size: 2,000 1,000 0 Conventional controller PRET controller 3,000 average latency [cycles] latency [cycles] 3,000 0 0.5 1 1.5 2 2.5 3 Interference [# of other threads occupied] 4096B transfers, conventional controller 4096B transfers, PRET controller 1024B transfers, conventional controller 1024B transfers, PRET controller Figure 9: Latencies of conventional and PRET memory controller with varying interference from other threads. 2,000 1,000 0 0 1,000 2,000 3,000 4,000 transfer size [bytes] Figure 10: Latencies of conventional and PRET memory controller with maximum load by interfering threads and varying transfer size. same under all different thread occupancies. This demonstrates the temporal isolation achieved by the PRET DRAM controller. Any timing analysis on the memory latency for one thread only needs to be done in the context of that thread. Weetalso the range of31 Reineke al.,seeSaarland PRET DRAM Controller vs Predator 150 latency [cycles] 125 100 Predator: • abstracts DRAM as single resource • uses standard refresh mechanism Private resources in backend “Manual” refreshes 75 Hiding refreshes 50 25 0 32 64 96 128 160 192 224 256 size of transfer [bytes] Shared Predator BL = 4 w/ refreshes DLr4,4 (x): Shared PRET BL = 4 w/ refreshes DLr (x): PRET BL = 4 w/ refreshes DLr (x): PRET BL = 4 w/o refreshes Figure 7: Latencies for small request sizes up to 256 bytes under Predator and PRET at burst length 4. In this, and all of the è PRET’s worst-case access latency of small transfers is smaller than Predator’s è PRET’s drawback: memory is private Reineke et al., Saarland 32 Resulting PRET Architecture PTARM Memory We have realized this in PTArm,Hierarchy a soft core on a Xilinx Virtex 5 FPGA Hardware Hardware thread Hardware thread Hardware thread thread registers Interleaved pipeline with one set of registers per thread scratc h pad SRAM scratchpad shared among threads memory memory memory memory DRAM main memory, separate banks per thread Note inverted memory compared to multicore! Fast, close memory is shared, slow remote memory is private! I/O devices Note inverted memory hierarchy! Reineke et al., Saarland 33 Lee, Berkeley 24 Conclusions ¢ Real-time computing needs real-time abstractions ¢ Potential for significant improvements in worst-case performance of some hardware realizations ¢ For more information on PRET: http://chess.eecs.berkeley.edu/pret/ Raffaello Sanzio da Urbino – The Athens School Lee, Berkeley 34 References ¢ [ICCD ‘12] Isaac Liu, Jan Reineke, David Broman, Michael Zimmer, Edward A. Lee. A PRET Microarchitecture Implementation with Repeatable Timing and Competitive Performance, To appear in Proceedings of International Conference on Computer Design (ICCD), October, 2012. ¢ [CODES ’11] Jan Reineke, Isaac Liu, Hiren D. Patel, Sungjun Kim, Edward A. Lee, PRET DRAM Controller: Bank Privatization for Predictability and Temporal Isolation, International Conference on Hardware/Software Codesign and System Synthesis (CODES +ISSS), October, 2011. [DAC ’11] Dai Nguyen Bui, Edward A. Lee, Isaac Liu, Hiren D. Patel, Jan Reineke, Temporal Isolation on Multiprocessing Architectures, Design Automation Conference (DAC), June, 2011. [Asilomar ’10] Isaac Liu, Jan Reineke, and Edward A. Lee, PRET Architecture Supporting Concurrent Programs with Composable Timing Properties, in Signals, Systems, and Computers (ASILOMAR), Conference Record of the Forty Fourth Asilomar Conference, November 2010, Pacific Grove, California. [CASES ’08] Ben Lickly, Isaac Liu, Sungjun Kim, Hiren D. Patel, Stephen A. Edwards and Edward A. Lee, " Predictable Programming on a Precision Timed Architecture," in Proceedings of International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), Piscataway, NJ, pp. 137-146, IEEE Press, October, 2008. ¢ ¢ ¢ Reineke et al., Saarland 35