Improving The Average-Case Using Worst-Case Aware Prefetching Jamie Garside Neil C. Audsley RTNS Versailles 2014 Background Theory System Design Experimental Evaluation Conclusions & Further Work RTNS Versailles 2014 2 Background Theory System Design Experimental Evaluation Conclusions & Further Work RTNS Versailles 2014 3 Modern Embedded Systems ‣ Due to the ever-increasing performance and memory requirements, modern embedded systems are starting to utilise multi-core systems with shared memory. ‣ ‣ Shared memory is typically a large off-chip DDR memory, rather than smaller, faster on-chip memories In order to be used in a real-time context, shared memory requires some form of arbitration ‣ Techniques do exist to analyse standard, non-arbitrated memory controllers in a shared context, although these do not guarantee any safety. RTNS Versailles 2014 4 Shared Memory Arbitration ‣ Shared memory arbitration is typically achieved through a large, monolithic arbiter next to memory. ‣ ‣ ‣ E.g. TDM, frame-based or credit-based arbitration. Tasks are typically given a static bandwidth allocation in order to access memory. ‣ Given this bandwidth allocation, the worst-case response time of a memory transaction can be bounded, since the maximum interference is known. ‣ Coupled with a model of the processor, a worst-case execution time can be ascertained and bounded But… RTNS Versailles 2014 5 Shared Memory Arbitration ‣ A static bandwidth bound may not be suitable for the whole life-cycle of a system. ‣ E.g. if a task requires a lot of memory bandwidth in the first 10% of its run-time, the bandwidth bound for the whole life-cycle must be high. ‣ Dynamic bounds can be used, although this complicates system analysis. ‣ Many arbiters can be run in work conserving mode in order to improve the average-case in these cases. ‣ ‣ If a task’s bound has been exhausted, issue requests at the lowest priority level, which will be accepted only if no other tasks are requesting. Work-conservation is only useful if there is actually some data that needs fetching… RTNS Versailles 2014 6 So…Prefetch! ‣ In an ideal scenario, the memory controller should never be idle. ‣ ‣ We can use prefetch to start fetching data that might be required by a processor ahead of time, such that it arrives into the cache of the processor just as it is required. ‣ ‣ There is likely always something it could be doing in order to improve the average-case. For example, if the processor has requested the data at addresses A, A+1 and A+2, it will likely require the data at address A+3 in the near future, and so it should be fetched If the data was deemed useful by the processor, then the processor/cache can notify the prefetcher of this using a “prefetch hit” message, and an access for the next data (e.g. A+4) can be dispatched. RTNS Versailles 2014 7 But… ‣ ‣ This technique is typically not used in real-time systems. ‣ Dynamically issuing memory requests on behalf of processors complicates system analysis due to the inherently random behaviour. ‣ Inserting data directly into a processor’s cache may displace useful cache data, effectively invalidating any cache analysis. We present a prefetcher design and analysis methodology which can allow a prefetcher to be utilised within a hard real-time system without harming the worst-case execution time of the system. RTNS Versailles 2014 8 Background Theory System Design Experimental Evaluation Conclusions & Further Work RTNS Versailles 2014 9 Prefetching ‣ In order to prevent the prefetcher from being able to harm the worstcase by issuing a request when it is not appropriate, a feedback mechanism is utilised. ‣ The arbiter can notify the prefetcher of available bandwidth using a prefetch “slot”. ‣ ‣ ‣ This is simply a blank packet transmitted from the arbiter to the prefetcher. The prefetcher can then either fill in this packet or ignore it. A prefetch slot is generated whenever a processor should be making a memory request (from the perspective of the system analysis), but isn’t. This can happen in two places: RTNS Versailles 2014 10 Prefetch Hit Feedback ‣ The first, and most obvious of these is prefetch hit feedback. ‣ Take a task as a stream of accesses to memory addresses (M1, M2, M3…), separated by amounts of computation (T1, T2, T3…). ‣ If a standard memory fetch has already been prefetched, it is effectively removed from this access stream. ‣ On access to this data, the cache will generate a prefetch hit notification. ‣ When this data is used by the processor, the memory request that would have taken place without a prefetcher no longer takes place, hence a prefetch slot can be dispatched. RTNS Versailles 2014 11 Work-Conservation ‣ Similar to work conservation, a prefetch slot can be dispatched whenever a processor isn’t fully utilising its bandwidth bound. ‣ As an example, in a non-workconserving TDM system, if processor is not making a request when it enters an active period, a prefetch slot can be dispatched instead. St Rq Rq Rq ‣ All other processors would be blocked anyway due to the nature of TDM. ‣ The current processor has missed its TDM window anyway, hence must wait for its window to re-start and is thus blocked. RTNS Versailles 2014 12 Impact on WCET Analysis ‣ ‣ Since the arbiter is notifying the prefetcher of slack time in the system, the worst-case execution time analysis need not change. ‣ Assuming that the worst-case response time of memory does not depend upon the ordering of requests. ‣ This scheme effectively exploits the worst-case blocking figures that a task can experience when it would normally be making a memory request. Since prefetch slots are generated at the same point that memory arbitration would take place, and at the same time, then the timing behaviour of the system remains the same. RTNS Versailles 2014 13 Background Theory System Design Experimental Evaluation Conclusions & Further Work RTNS Versailles 2014 14 Implementation ‣ A system based on this theory was implemented using the Bluetree network-on-chip. ‣ This separates the inter-processor communication and memory communication by using separate networks for both. ‣ Since memory traffic does not need to communicate between processors, a tree structure is used to connect processors to memory, with a set of multiplexers at each level of the tree. ‣ Also included is a prefetch cache. This is a simple, small direct-mapped cache simply to store prefetches in order to prevent a prefetch displacing any useful cache information. RTNS Versailles 2014 15 Bluetree Arbitration ‣ In order to be timing predictable, each multiplexer in the tree contains a small arbiter. ‣ Each input to the multiplexer has an implicit priority (e.g. left hand side is high priority). ‣ The arbiter encodes a “blocking counter”. This is how many times a lowpriority packet has been blocked by a high-priority packet. B=0 HP1 LP1 ‣ When this reaches a pre-determined value m, a low-priority packet is given priority over a high-priority packet RTNS Versailles 2014 16 Bluetree Arbitration ‣ In order to be timing predictable, each multiplexer in the tree contains a small arbiter. ‣ Each input to the multiplexer has an implicit priority (e.g. left hand side is high priority). ‣ The arbiter encodes a “blocking counter”. This is how many times a lowpriority packet has been blocked by a high-priority packet. B=1 HP2 LP1 ‣ When this reaches a pre-determined value m, a low-priority packet is given priority over a high-priority packet RTNS Versailles 2014 17 Bluetree Arbitration ‣ In order to be timing predictable, each multiplexer in the tree contains a small arbiter. ‣ Each input to the multiplexer has an implicit priority (e.g. left hand side is high priority). ‣ The arbiter encodes a “blocking counter”. This is how many times a lowpriority packet has been blocked by a high-priority packet. B=2 HP3 LP1 ‣ When this reaches a pre-determined value m, a low-priority packet is given priority over a high-priority packet RTNS Versailles 2014 18 Bluetree Arbitration ‣ In order to be timing predictable, each multiplexer in the tree contains a small arbiter. ‣ Each input to the multiplexer has an implicit priority (e.g. left hand side is high priority). ‣ The arbiter encodes a “blocking counter”. This is how many times a lowpriority packet has been blocked by a high-priority packet. B=0 HP3 LP2 ‣ When this reaches a pre-determined value m, a low-priority packet is given priority over a high-priority packet RTNS Versailles 2014 19 Bluetree Slot Dispatching ‣ Slot dispatching can be done in much the same way as detailed previously. ‣ B=0 HP1 When a processor would normally be making a request, a slot can be dispatched instead. ‣ In this context, if a low-priority packet is relayed and there is nothing blocking it, then a prefetch slot should be dispatched instead ‣ Similarly, if m high-priority packets have been dispatched and no low-priority packets are waiting, a prefetch slot should be dispatched RTNS Versailles 2014 20 Bluetree Slot Dispatching ‣ Slot dispatching can be done in much the same way as detailed previously. ‣ B=1 HP2 When a processor would normally be making a request, a slot can be dispatched instead. ‣ In this context, if a low-priority packet is relayed and there is nothing blocking it, then a prefetch slot should be dispatched instead ‣ Similarly, if m high-priority packets have been dispatched and no low-priority packets are waiting, a prefetch slot should be dispatched RTNS Versailles 2014 21 Bluetree Slot Dispatching ‣ Slot dispatching can be done in much the same way as detailed previously. ‣ B=2 HP3 ST When a processor would normally be making a request, a slot can be dispatched instead. ‣ In this context, if a low-priority packet is relayed and there is nothing blocking it, then a prefetch slot should be dispatched instead ‣ Similarly, if m high-priority packets have been dispatched and no low-priority packets are waiting, a prefetch slot should be dispatched RTNS Versailles 2014 22 Prefetcher Design ‣ The prefetcher is a plain stream prefetcher, implemented as a global prefetcher at the root of the tree. ‣ If a processor has requested the memory at addresses A, A+1 and A+2, it will fetch A+3 on behalf of the processor. If this is a prefetch hit, it will then fetch A+4. ‣ Its position at the root enables it to obtain prefetch slots from all processors, allowing it to perform a prefetch on behalf of any processor when a slot is received. RTNS Versailles 2014 23 Background Theory System Design Experimental Evaluation Conclusions & Further Work RTNS Versailles 2014 26 Implemented Design ‣ This design was implemented on a 16-core Microblaze system on a Xilinx VC707 evaluation board, with a 1GB off-chip DDR3 memory module. ‣ Each processor and the memory interconnect was running at 100MHz. The memory controller was clocked at 200MHz. ‣ The memory interconnect was running with m=3, i.e. a low-priority packet can be blocked by at most three high-priority packets. RTNS Versailles 2014 27 Evaluation Methodology ‣ The system was evaluated using two hardware configurations ‣ 1) Sixteen systems, where in each a single processor was connected to one of the possible connections on the tree. The rest were then synthetic traffic generators which issued a memory request every cycle in order to simulate a fully loaded tree. ‣ 2) Sixteen Microblaze processors, all connected to the tree. ‣ The system was then evaluated using two software stacks: ‣ 1) Software traffic generators. These issued requests for subsequent cache lines with a delay between each fetch. ‣ 2) A selection of benchmarks from the TACLeBench suite. RTNS Versailles 2014 28 S/W Traffic Generators in High-Load System Index 0 Index 2 ‣ The prefetcher can make an improvement to the execution time of the traffic generators in most systems. ‣ As the delay reduces, prefetches start to become coalesced with their demand accesses, explaining noisy behaviour for Index 0. ‣ In Index 12, the prefetch off “steps” are caused by blocking on the tree. RTNS Versailles 2014 29 S/W Traffic Generators in High-Load System Index 12 Index 13 At higher indices (i.e. lower priority), the additional blocking leads to larger “steps”. ‣ ‣ ‣ The prefetch on line exhibits some “spikes” too. These are when a prefetch cannot be coalesced with its demand access, hence there is a performance degradation. Eventually, at higher indices and higher loads, the prefetcher is ineffective. ‣ There is so much blocking that a prefetch slot cannot be dispatched before its corresponding memory access. RTNS Versailles 2014 30 S/W Traffic Generators in 16-Core System Index 0 Index 2 ‣ The lines in a 16-core system show similar results. ‣ Due to an increase in the available bandwidth, the improvement brought about by the prefetcher can still be seen with a lower delay. RTNS Versailles 2014 31 S/W Traffic Generators in 16-Core System Index 12 ‣ Index 13 Even at higher indices, the prefetcher is still effective. ‣ There are enough prefetch slots dispatched such that prefetches can always be dispatched. The “step” effect is still present, but not as apparent. ‣ ‣ This is simply because there is more bandwidth available, hence less blocking in the “prefetch off” case. RTNS Versailles 2014 32 TACLeBench on High-Load System ‣ TACLeBench shows a good speedup for many tasks. ‣ Quite a few tasks show no speedup for CPUs 7, 11 and 13. These CPUs have the highest amount of blocking in the system, hence typically a prefetch slot cannot be dispatched before the processor requests normal data. ‣ basicmath and rijndael have large math routines exceeding I-Cache size, hence are good for prefetch. ‣ crc and sha have large input data, larger than D-Cache, hence are also good for prefetch. ‣ The rest are mostly computation dominated, although can still benefit from prefetch. RTNS Versailles 2014 33 TACLeBench on 16-Core System ‣ Sadly, prefetch is not good for all tasks. ‣ The time between memory requests in crc and basicmath is large enough that the prefetcher can still be useful ‣ For some benchmarks, such as rijndael and gsm, the computation time between requests is so small, that a prefetch gets dispatched just after the data has already been fetched, hence contributes to a performance degradation. ‣ ‣ Note that the execution time is still better than the worst-case, however. Other tasks are still improved, but not by as much due to the lower amount of contention on the tree. RTNS Versailles 2014 34 Background Theory System Design Experimental Evaluation Conclusions & Further Work RTNS Versailles 2014 35 Conclusions & Further Work ‣ We present a prefetcher that can be used within the context of realtime embedded systems without harming the worst-case execution time. ‣ Many of the issues with the prefetcher in real-world applications arise from prefetches being dispatched too late. Typically, this can be fixed by fetching further ahead of the stream. ‣ Adaptive techniques can be used, e.g. Srinath et al 2007 propose a prefetcher that adapts its parameters based on the accuracy and timeliness of prefetches. ‣ In the tree-based system, we can also turn off prefetch slot generation at certain levels in order to rate limit the number of prefetches which are generated. RTNS Versailles 2014 36 Conclusions & Further Work ‣ It is also possible to utilise the prefetcher in order to improve the worstcase execution time. ‣ The prefetcher can either be allocated a bandwidth bound, which can be rolled into the system analysis, or the number of slots which may be generated from a processor’s lack of requests may be able to be ascertained. ‣ Using this, the worst-case inter-prefetch time can be derived. This can then be used to ascertain when a prefetch will be dispatched for a given reference stream, in the worst case, and hence be rolled into the worst-case execution time analysis to reduce the WCET of a task. RTNS Versailles 2014 37 Any Questions? RTNS Versailles 2014 38