Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi Department of Electrical Engineering Princeton University ASPLOS’10 TLB management • Hardware-managed TLB – No need for expensive interrupts – Pipeline remains largely unaffected – OS cannot employ alternate design • Software-managed TLB – Data structure design is flexible since the OS controls the page table walk – Miss handler is also instructions • It may itself miss in the inst. cache. – Data cache may be polluted by the page table walk Multiprocessor TLB miss • CMP maintains per-core instruction and data TLBs. • Significant similarities exist in TLB miss patterns among multiple cores. Predictable TLB Miss Pattern • Inter-core Shared (ICS) TLB Misses – Translation accessed by a previous miss on any of the other cores with the same virtual page, physical page, context ID, and page size – Leader-Follower prefetching • Inter-core Predictable Stride (ICPS) TLB Misses – A stride of S if its virtual page V+S differs by S from the virtual page V of the preceding matching miss • Core 0 TLB Miss virtual pages : 3, 4, 6, 7 • Core 1 TLB Miss virtual pages : 7, 8, 10, 11 – Core distances are 1, 2, 1 • Although the cores are missing on different virtual pages, they both have the same distance pattern in their misses – Distance-based cross-core prefetching Leader-Follower Prefetching • If a core (the leader) TLB misses on a particular virtual page entry, other cores (the followers) will also typically TLB miss on the same virtual page eventually • Pushing virtual page entry into the followers’ TLB • Not directly into the TLB, but instead insert into a small separate Prefetch Buffer(PB). – The bad prefetch may be harmful in that it will be unused. – The prefetch may be harmful in that it will evict existing PB entries too early Leader-Follower Prefetching • Case 1 – D-TLB miss / PB hit on core 0 • remove the entry from core 0’s PB • Add the entry to its TLB • Case 2 – D-TLB miss / PB miss on core 1 • Translation is located and refilled into the D-TLB • Prefetched(pushed) into PBs of the other cores Leader-Follower Prefetching • Prefetch a translation into all the follower cores every time a TLB and PB miss occurs on the leader core – This approach may be over-aggressive • Confidence estimation – 2-bit saturating counters • Core 0 has counters for cores 1 to N-1 • B-bit confidence counter is greater or equal to 2B-1, prefetch to a follower Leader-Follower Prefetching • Case 1 • Case 2 • Case 3 – PB hit on core 0 and insert PB entry into D-TLB – Identify the initiating core(core 1) – Increment core 1’s confidence counter corresponding to core 0 – D-TLB / PB miss on core 1 – Check the confidence counter ≥2B-1 – If core 1’s counter corresponding to core 0 is above this value, pushes the translation into core 0’s PB – PB entry is evicted from core N-1 without being used. – Send message –bad prefetch- to the core that initiated this entry (core 1) – Core 1’s counter corresponding to core N-1 is decremented Distance-Based Cross-Core Prefetching • Although the cores are missing on different virtual pages, they can both have the same distance pattern in their misses • Record repetitive distance-pairs to find the next predicted distance and hence the next virtual pages. – Find the stride patterns Distance-Based Cross-Core Prefetching • 1. PB miss : calculate the current distance (current TLB miss virtual page - last virtual page) • 2. Look up the distance table(DT) using the current distance & the last distance • 3. DT extracts predicted future distances from the stored distance-pairs – (1,2), (2,1)…… • 4. the predicted distances are used to calculate the corresponding virtual pages and insert into PB Result 16 entries in PB, Average 46%