Motivation CSC 573 Lecture Feb. 8, 2005 • (DS)2 Locality Models Dependence, Stream, Distance, and Stride • • Adapted from “Program Locality Models and Their Use in Memory Performance Optimization,” a tutorial at ASPLOS 2004 by Steve Carr Michigan Tech Trishul Chilimbi Microsoft Research Chen Ding Youfeng Wu U. of Rochester Intel MTL CSC573, Computer Science, U. of Rochester • 1 CSC573, Computer Science, U. of Rochester • • • Compilers – effective for scalars – for loop nests with linear index expressions – need to approximate branches, recursion, and indirect data access Profiling – accurate for one input – need to predict behavior in other inputs Run-time analysis – find input-dependent patterns – need to be efficient CSC573, Computer Science, U. of Rochester 2 DS 2 Locality Models Program Locality • Computer speed improvement – powered by Moore’s law – supercomputers [Allen&Kennedy, Optimizing Compilers, page 2] “Memory wall” [Wulf&McKee, CAN95 ] Software solutions – improving cache utilization • temporal locality • spatial locality • reduce both latency and bandwidth – prefetching • reduce latency but not bandwidth Need to understand program locality – when do data accesses happen – to what data • • • 3 Dependence – dependences among data references in a program – static measurement of frequency, distance, and stride Stream – frequent sequences of data access in a trace Distance – patterns of the reuse distance of data access in a trace Stride – patterns of the stride of data access in a trace CSC573, Computer Science, U. of Rochester 4 The Distance of Data Reuse Distance Model • The Distance Model and Its Uses Chen Ding Computer Science Department University of Rochester Rochester, New York • • Stack distances – Evaluation techniques for storage hierarchies • by R.L. Mattson, J. Gecsei, D. Slutz, and I.L. Traiger • IBM System Journal, volume 9 (2), 1970, pp. 78-117 – one-pass evaluation of hardware configurations – basis for memory management and cache design LRU stack distance between two accesses – the volume of data accessed in between The reuse distance of an access – the LRU stack distance between the access and the previous access to the same data CSC573, Computer Science, U. of Rochester 6 1 Reuse Distance • Measuring Reuse Distance Definition – the number of distinct elements between this and the previous access to the same data – measures volume not time – Euclidean d=2 bcaacb • – naïve counting, O(N) time per access, O(N) space • N is the number of memory accesses • M is the number of distinct data elements – Too costly • up to 120 billions accesses in our tests • up to 25 million data elements A good notion of program locality – bounded – independent of hardware and – independent of variations in coding and data allocation Precise Methods Approximation • • – stack algorithm [Mattson+ IBM70] – O(M) time per access, O(M) space – search tree [Olken LBL 81, Sugumar&Abraham UM93] – O(logM) time per access, O(M) space – the space cost remains a major problem Tree node ( time range, weight, capacity, size) • Basic idea – measure only the first few digits of a long distance – use non-unit size tree nodes • tree size = M / average node size • the node size bounds the measurement error Guaranteed relative accuracy – a <= measured/actual_distance <= 1 • e.g. a = 99% – logarithmic space cost if a < 1 Hashtable cost – space problem solved by Bennett & Kruskal in 1975 – not considered in the discussion Tree node ( time range, weight, capacity, size) (8-10, 7, (8-10 , 7, 2, 2) 2, 2) The three tree nodes have capacities (1-7 , 4, 6, 4) (11-11 , 1, 1, 1) 1, 2, and 6. (1-7, 4, 6, 4) (11-11, 1, 1, 1) It guarantees 33% accuracy. 2 Tree node Tree node (time, weight, capacity, size) (time, weight, capacity, size) (8-10 , 7, 2, 2) Search for last access of b, whose access time is 4. Tree node (8-10 , 7, 2, 2) 4 ∈ (1-7) (1-7 , 4, 6, 4) (11-11 , 1, 1, 1) (1-7 , 4, 6, 4) Tree node (time, weight, capacity, size) (time, weight, capacity, size) (8-10, 7, 2, 2) Set d to be 0 first. (8-10, 7, 2, 2) Add node size: The error in distance (1-7, 4, 6, is at most 4. (11-11 , 1, 1, 1) 4) d += 2 (11-11, 1, 1, 1) (1-7, 4, 6, 4) (11-11, 1, 1, 1) Measurement of Reuse Distance Algorithms Tree node Time Space Naïve counting O(N2) O(N) Trace as a stack (or list) [IBM’70] O(NM) O(M) Trace as a vector [IBM’75, UIUC’02] O(NlogN) O(N) Trace as a tree [Berkeley’81, O(NlogM) O(N) List-based aggregation [Wisconsin’91] O(NS) O(M) Approx. tree [Rochester’03] O(NloglogM) O(logM) (time, weight, capacity, size) (8-10, 7, 2, 2) Add node weight: Michigan’93, UIUC’02] d += 1. Measured distance is 3, 60% of the actual distance. (1-7, 4, 6, 4) (11-11, 1, 1, 1) CSC573, Computer Science, U. of Rochester 18 3 Analysis Accuracy for FFT approximation 99% million refs per sec. • 2 1 1E +1 +1 0 1E +1 +9 +8 data size (reuse distance=1% data) Pattern Recognition and Prediction Spec2K/Lucas Pattern analysis – predictable and unpredictable parts – constant, linear, sub-linear – correlation among training inputs • distance is independent of code and data train 1 train 2 prediction test 35 30 25 % references • 1E reuse distance 1E 1E 55K 57K 59K 61K 63K 66K +7 0 +5 0 1E 0.5 1 +6 1 1E 2 1.5 Bennett-Kruskal 2 1E accurate, 65783 tree nodes 99.9%, 5869 tree nodes 99%, 823 tree nodes 2.5 % references Analysis Speed on 1.7GHz Pentium 4 – regression analysis – distance-based sampling Issues – granularity of measurement and prediction – the separation of patterns – high-dimensional data – predicting locality but not time 20 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 reuse distance (log-scale) • 95.1% accuracy and 99.6% coverage for Lucas • average 94% accuracy for all programs CSC573, Computer Science, U. of Rochester 21 SP reuse miss rate for 1M cache Prediction Train 1 Train 2 SP reuse miss rate for 1M cache Largest Simulation 8% 8% 6% Pred SA_1 SA_2 SA_4 SA_8 FA miss rate miss rate 6% 4% 4% 2% 2% 0% 1.E+03 1.E+05 1.E+07 1.E+09 data size in cache blocks CSC573, Computer Science, U. of Rochester 1.E+11 23 0% 1.E+4 1.E+5 1.E+6 1.E+7 data size in cache blocks CSC573, Computer Science, U. of Rochester 24 4 A Web-based Interactive Tool Cross-architecture performance prediction • • • • http://www.cs.rochester.edu/research/locality CSC573, Computer Science, U. of Rochester 25 CSC573, Computer Science, U. of Rochester Cache Hint Selection • • • • • – predict L1, L2, and L3 miss rate – insert hints based on a threshold Evaluation – Intel Itanium – Spec95fp benchmarks – 7% average performance improvement • 27 C. Fang, S. Carr, S. Onder and Z. Wang from Michigan Tech. – “Reuse-distance-based Miss-rate Prediction on a Per Instruction Basis” – Workshop on Memory System Performance (MSP), 2004 Methodology – reuse distance models for each memory reference – predict references that cause most cache misses Results – Spec2Kfp benchmarks – over 90% references can be predicted with 97% accuracy – ~2% load instructions account for 95% of misses CSC573, Computer Science, U. of Rochester Other Distance-Based Studies • 26 Per-Instruction Locality Kristof Beyls Erik H. D'Hollander from Ghent University – “Reuse distance-based cache hint selection” – EuroPar 2002 Methodology – profile reuse distance for each memory reference CSC573, Computer Science, U. of Rochester • Gabriel Marin and John Mellor-Crummey – “Cross-architecture performance prediction for scientific applications using parameterized models” – a full paper in SIGMetrics 2004 Methodology – reuse distance models – compiler support • program, routine, and loop scopes Results – Sweep3D from ASCI, LU, BT, and SP from NASA – measured using SPARC binaries – predict L1, L2, TLB misses, and running time on Origin2K 28 Comparison with Other Locality Models Register and cache performance modeling – Li et al. from Purdue, Interact 1996 – Huang and Shen, CMU, Micro 1996 – Beyls and D’Hollander from Ghent, PDCS 2001 – Almasi et al. from Illinois, MSP 2002 – Zhong et al. from Rochester, LCR 2002 File caching – Zhou et al. from Princeton, USENIX 2001 – Jiang and Zhang from William and Mary, SIGMetrics 2002 • • • with compiler dependence analysis – distance analysis is generally applicable – less structured and incomplete information from regular loop nests – cannot reorder computation with frequency profiling – distance analysis finds repetitions at larger granularity – more temporal information – no prediction of access order with stride analysis – distance analysis predicts reuses – no prediction of access order CSC573, Computer Science, U. of Rochester 29 CSC573, Computer Science, U. of Rochester 30 5 Summary • • Dependence Model Distance-based locality analysis – more general than compiler analysis – more structured than traditional profiling – best for analyzing long-range program behavior patterns Uses – whole-program locality prediction – affinity-based structure and array reorganization – phase prediction for pro-active system adaptation Dependence-based Locality Analysis and Optimization copied from slides by Steve Carr Department of Computer Science Michigan Technological University Houghton, Michigan CSC573, Computer Science, U. of Rochester 31 CSC573, Computer Science, U. of Rochester Dependence and Loops • • Dependence Distance Vector Iteration vector: a vector of values for the loop control variables – one entry per variable – indexed from outermost to innermost The set of all iteration vectors for a loop is called the iteration space • Suppose that there is a dependence from reference R 1 on iteration i of a loop nest to reference R 2 on iteration j of the loop nest, then the dependence distance vector is d(i,j) DO I = 1, 100 DO J = 1, 100 A(I,J) = A(I-3, J-1) DO I = 1, 10 DO J = 3, 5 DO K = 6, 9 A(I,J,K) = ... when I = 1, J = 4, and K = 7 the iteration vector is 〈1,4,7〉 CSC573, Computer Science, U. of Rochester A(2,2) is accessed by A(I,J) on iteration 〈2,2〉 A(2,2) is accessed by A(I-3,J-1) on iteration 〈5,3〉 The distance vector is 〈3,1〉 33 CSC573, Computer Science, U. of Rochester Cache Line (Block) • • • 32 34 Temporal Reuse Definition The unit of memory in a cache is a line (block) A line may have 1 or more words ( a word is typically 4 or 8 bytes) Example Cache Line • temporal reuse - reuse of the same memory location Do I = 2, N 32 bytes A(I) = A(I-1) Access the same element one iteration apart ENDDO • • Accessing any member of the line brings the entire line into cache (replacing whatever was there previously) Interference - multiple lines that map to the same cache location and need to be in the cache at the same time CSC573, Computer Science, U. of Rochester 35 Note that the true dependence from A(I) to A(I-1) tells that the temporal exists CSC573,reuse Computer Science, U. of Rochester 36 6 Spatial Reuse Definition • Self Reuse Definition spatial reuse - reuse of a cache block with a nearby memory reference • Self spatial Do I = 1, N A(I) = DO I = 1, N A(I) = ENDDO Access the same cache line on successive iterations ENDDO CSC573, Computer Science, U. of Rochester Group Temporal DO I = 1, N A(I) = A(I-1) ENDDO DO I = 1, N A(I) = A(I-1) ENDDO • CSC573, Computer Science, U. of Rochester RefGroup Definition • 40 RefGroup Leader represent group reuse all references in one reference group will have some kind of group reuse Two references R1 and R2 are in the same reference group with respect to Loop L if: r – R1δ R2 (group-temporal reuse) r • δ is a loop-independent dependence • δ L is a small constant d and all other entries are zero – R1 and R2 differ in the in the first subscript dimension by a small constant d and all other subscripts are identical. (group-spatial reuse) (d is decided by cache line size and array element size) CSC573, Computer Science, U. of Rochester McKinley, Carr & Tseng 96 – Compute reuse across the innermost loop only – Self reuse • examine the subscript • Temporal - missing the innermost induction variable • Spatial - innermost induction variable in 1st subscript position only – Group reuse • look at reference connected by a dependence carried by innermost loop or loop independent Access the same cache line Access the same element one CSC573, Computer Science, U. of Rochester 39 on same iteration iteration apart • • DO I = 1, N DO J=1,N A(I) = Dependence-based Reuse Analysis group reuse - reuse that arises due to multiple static references Group spatial Self temporal Access the same cache line Access the same cache location CSC573, Computer Science, U. of on successive iterations onRochester successive iterations 38 37 Group Reuse Definition • self reuse – reuse that arises due to a single static reference 41 • • The reference that brings in the cache lines accessed by other members Leader is – the reference without an incoming dependence – or all outgoing edges are loop carried and references are invariant at source and sink or any incoming edge is from a reference that is not in the RefGroup CSC573, Computer Science, U. of Rochester 42 7 Leader Example Loop Cost • DO K = 1, N DO J=1,N DO I=1,N • 〈0,0,1〉 〈0,1,0〉 Loop Cost gives the number of cache lines that are brought into the cache if a given loop is innermost. This relates to the locality of the loop. Given the RefGroups we can compute the cost of a loop, LC(L) by summing the cost of each RefGroup, RC(R). – Let R be the leader of a RefGroup LC ( L) = A(I,J,K) = B(I,J+1,K)+B(I,J,K)+ B(I+1,J,K) ∑ RC ( R ) × ∏ trip k 1≤ k ≤ n 〈0,1,-1〉 h h≠L where triph is the number of loop iterations for loop h Leaders(K) = {A(I,J,K), B(I+1,J,K), B(I,J+1,K)} Leaders(J) = {A(I,J,K), B(I+1,J,K)} Leaders(I) = {A(I,J,K), B(I+1,J,K), B(I,J+1,K)} CSC573, Computer Science, U. of Rochester 43 CSC573, Computer Science, U. of Rochester RefGroup Cost 1 if ⎧ ⎪ triph RC ( Rk ) = ⎨ if ⎪ cls / stride if ⎩ triph 44 Loop Cost Example DO K = 1, N DO J=1,N DO I=1,N Rk is self-temporal Rk is self-spatial Rk has no self-reuse 〈0,0,1〉 〈0,1,0〉 A(I,J,K) = B(I,J+1,K)+B(I,J,K)+ B(I+1,J,K) 〈0,1,-1〉 3()3 C N Leaders(K) = A(I,J,K), B(I+1,J,K), B(I,J+1,K)} Leaders(J) = {A(I,J,K), B(I+1,J,K)} 3()2LCJN 3()3NLCIl CSC573, Computer Science, U. of Rochester Leaders(I) = {A(I,J,K), B(I+1,J,K), B(I,J+1,K)} 45 CSC573, Computer Science, U. of Rochester How To Use Loop Cost • • • • Uniformly Generated Sets The loop cost gives the number of cache lines accessed by a loop as if it were innermost The number of cache lines is a measure of locality lower loop cost implies better locality Order loops based on decreasing cost from outer to inner • • • CSC573, Computer Science, U. of Rochester 46 47 Gannon, Jalby and Gallivan 1988 Def: Let n be the depth of a loop nest and [()]Afir d be the [()]Agir dimensions of an array D. Two references a(<f 1,f2, …,f k>) ndZZ and b((<g 1,g2, …,gk>), where f and g are index functions, are called ()() uniformly generated if fgfiHicgiHic • <f1,f 2, …,fk> = H < i1,i2, …,iL> + C f • <g1,g 2, …,g k> = H < i 1,i2, …,i L> + Cg where H is a linear transformation and fC are crf and Cggcr constant vectors. CSC573, Computer Science, U. of Rochester 48 8 Localized Vector Space • • • • Nearby Permutation Example Wolf and Lam 1991 Z depth n corresponds to a finite convex A loop nest nof polyhedron bounded by the loop bounds. The loop iterations that exploit reuse and have locality are called the localized iteration space. If we abstract away the loop bounds we have a localized vector space, L. DO I = 1, 100 DO J = 1, 100 DO K = 1, 100 A(I,J,K) = A(I+1,J-1,K)+B(J,I,K) ()300 00() 00 00 〈0,1,-1〉 order K,J,I is illegal use K,I,J CSC573, Computer Science, U. of Rochester 49 Application - Loop Tiling do j = 1, n do i = 1, n do k = 1, n c(i,j) += a(i,k) * b(k,j) 50 Application - Software Prefetching and Scheduling do i = 1, n do j = 1, n x(i) += m(i,j) do ii = 1, n, is do kk = 1, n, ks do j = 1, n, js do i = ii, min(ii+is-1,n) do k = kk,min(kk+ks-1,n) c(i,j)+=a(i,k)*b(k,j) CSC573, Computer Science, U. of Rochester CSC573, Computer Science, U. of Rochester 51 do i = 1, n do j = 1, n prefetch(m(i,j+c)) x(i) += m(i,j) CSC573, Computer Science, U. of Rochester 52 Limits of Compile-time Analysis • • • Stream Model Not all information is known at compile time – loop bounds – cache interference What if references are not uniformly generated? – index arrays Non-scientific code The Hot Data Stream Model Reformatted from slides by Trishul Chilimbi Runtime Analysis & Design Group http://research.microsoft.com/~trishulc/Daedalus.htm CSC573, Computer Science, U. of Rochester 53 CSC573, Computer Science, U. of Rochester 54 9 Exploitable Locality SEQUITUR (Nevill-Manning & Witten) Exploitable Locality = Reference Skew + Regularity aaabac aaabac aaabac aaabac aaabad aaabad aaabad aaabad aaabad aa Reference Skew SEQUITUR SEQUITUR S -> BBDDCaa abcacbdbaecfbbbcgaafadcc A -> aaabac Regularity abchdefabchiklfimdefmklf S B -> AA B C -> aaabad Exploitable locality: Reference Skew + Regularity abcabcdefabcgabcfabcdabc CSC573, Computer Science, U. of Rochester SEQUITUR Grammar 55 Address Abstraction 0x32cc00 Data Reference Trace b a • • regularity detector C B b c d Whole Program Streams (WPS) b 57 C’’ 10 5 CSC573, Computer Science, U. of Rochester 59 r f ve ol ls er tw CSC573, Computer Science, U. of Rochester sq ex rt m k vo eo rlb pe Stream Flow Graph (SFG) n 0 er B’’ 70 15 c 5 20 gc Graph summarization 70 A’’ 75 58 25 pa B Whole program A’’ d’ b’ a’ streams (WPS) B’’ a’ b’ Hot Data Streams Wt. avg. # of elements in hot data streams S A CSC573, Computer Science, U. of Rochester Hot Data Stream Size (Regularity) Inter-stream Analysis b’ 56 SEQUITUR S CSC573, Computer Science, U. of Rochester a’ of grammar d Eliminating noise Inter-stream analysis rs a c Abstracted Trace A a d b Efficient Offline Implementation b Increases repetitions Hot Data Streams b’ a DAG representation a Hot data stream analyses a’ C CSC573, Computer Science, U. of Rochester Computing Exploitable Locality 0x53cd1e D A D -> CC 60 10 Efficient Online Implementation Bursty Tracing comp1 CSC573, Computer Science, U. of Rochester proc1 61 CSC573, Computer Science, U. of Rochester Duplicate Basic Blocks comp1 63 CSC573, Computer Science, U. of Rochester Dynamic Optimization Extensions Add Instrumentation Awake phase prof$proc1 CSC573, Computer Science, U. of Rochester 64 65 Hibernate phase nHibernate CSC573, Computer Science, U. of Rochester comp1 proc1 prof$proc1 prof$proc1 nProf proc1 62 Insert Dispatch Checks comp1 proc1 nOrig • Low-overhead profiling – Bursty Tracing – Dynamic Optimization Extensions Low-overhead analysis nAwake time • CSC573, Computer Science, U. of Rochester 66 11 Fast Hot Data Stream Detection 0:3 C A uses : word length 1:2 cold uses : word length a a s5 v=abacadae s6 w=bbghij b s0 b b b s 4 b s2 b {[w,1]} g {[w,2]} {[w,3]} CSC573, Computer Science, U. of Rochester 69 CSC573, Computer Science, U. of Rochester Putting it all together Dynamic Vulcan injection Sequitur grammar analysis finite state hot data machine streams profiling analysis and optimization execution with prefetching hibernation 8 7 6 5 4 3 2 1 0 Checks Profiling Analysis r deoptimization % overhead code data reference sequence Profiling and Analysis Overhead vp timeline execution with profiling 70 im a s3 xs b a a.pc: if(accessing a.addr){ if(v.seen == 2){ v.seen = 3; prefetch c.addr,a.addr,d.addr,e.addr; }else{ v.seen = 1; } }else{ v.seen = 0; } b.pc: if(accessing b.addr) if(v.seen == 1) v.seen = 2; else v.seen = 0; else v.seen = 0; {[v,3],[v,1]} bo a 68 Prefetch Generation cf {} a s1 Hot/cold splitting } CSC573, Computer Science, U. of Rochester m a {[v,2],[w,1]} …; 8—23% improvements in execution time 67 Prefetch Generation {[v,1]} struct cnode { 20 min Hot data stream: abcabc (generated by non terminal B) CSC573, Computer Science, U. of Rochester struct node { int key; struct node* next; struct cnode* cold;} ex reverse postorder numbering 4:3 5:2 rt parse tree C A struct node { int key; struct node* next; …;} vo grammar (omitting terminals) a ba a bca b ca bca bc 2 3 A 2:6 B struct node { Field reordering int key; … struct node* next;} er C A 2:6 B 1:15 S rs A A 1 B C C C A A 1:15 S f C C A B B 0 pa B S ol S tw S Structure Layout Transformations 3—7% total overhead CSC573, Computer Science, U. of Rochester 71 CSC573, Computer Science, U. of Rochester 72 12 Dynamic Prefetching Evaluation Conclusions • 20 15 5 im Analysis No pref Seq pref Dyn pref • • bo xs ex rt vo rs er f pa ol tw cf m -5 -10 r 0 vp % overhead 10 -15 Hot data streams useful locality model – Can handle irregular, pointer-based codes – Can be efficiently computed online – Small number account for 90% of data references Stream flow graph compact representation – 10 GB Trace => 100 KB SFG – Analyze hot data stream interactions Stream-based optimizations appear promising – Good performance improvements -20 5—19% overall execution time improvement CSC573, Computer Science, U. of Rochester 73 CSC573, Computer Science, U. of Rochester 74 Memory Optimization Opportunity Stride Model 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% instr access issue limit ar s 2 5 er 2.e 3.p on er lbm k 25 4 25 . g a p 5.v or 2 5 tex 6.b zip 30 2 0.t wo lf data access 7.p RSE 25 cf fty ra 19 1.m 6.c 18 pr 6.g 4.g 17 16 cc exec latency zip Programming Systems Research Intel Labs fetch window 18 Youfeng Wu taken br 17 Reformatted from slides by pipe BE flush 5.v Efficient Discovery of Regular Stride Patterns in Irregular Programs and Its Use in Data Prefetching 4MB L3 CSC573, Computer Science, U. of Rochester 75 CSC573, Computer Science, U. of Rochester Loads Responsible for the Data Cycles Concentration of delinquent Loads • • Cache simulation to identify delinquent loads cummutive misses/load sites – The set of load instructions that account for 95% or more of L1, L2, L3, DTC, and DTLB misses – Some references are not simulated • Library references: ~20% • Additional references from speculation to more frequent blocks: ~8% 1.2-3.6% of executed load sites produce 95% misses 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 L1 cache misses L2 cache misses L3 cache misses DTC misses DTLB misses 0% CSC573, Computer Science, U. of Rochester 76 77 1% 2% 3% 4% 5% 6% CSC573, Computer Science, U. of Rochester 7% 8% 9% 10% 78 13 Motivations • • • Stride Profiling Data prefetching for regular programs is very effective – Array references often have constant strides Automatic prefetching of delinquent loads in irregular programs is hard – Future data addresses are difficult to determine automatically – Blindly applying automatic prefetching actually slows down SPECint2000 on Itanium Opportunities – Some delinquent loads have stride access patterns CSC573, Computer Science, U. of Rochester • For each static load – Collect top N most frequently occurred strides and their frequencies [Calder-97] – Count the number of times the strides change Addresses from a static load: 1,3,5,7,9,11,111,211,311,411,413,415,417 Strides: 2, 2,2,2,2,100,100,100,100,2,2,2 Stride profile: #1 stride = 2, #1 stride freq = 8 #2 stride = 100, #2 stride freq = 4 Number ofCSC573, stride changes = 3 Computer Science, U. of Rochester 79 Discover Stride Patterns 80 Prefetching for Multi-Stride Load • Strong single stride (SSST) – One non-zero stride, e.g. 60, occurs very frequently • Phased multi-stride (PMST) – Multiple non-zero strides together occur frequently – The strides change infrequently Compute stride and prefetch prev_P = P While (P) { stride = (P-prev_P); prev_P=P; L: prefetch(P+2*stride) D= P->data Use (D) P = P->next delinquent } CSC573, Computer Science, U. of Rochester 81 Prefetching for Single-Stride Load 82 Key Observations for Efficient Profiling Prefetch with a constant stride, e.g. 60 • Stride patterns exist mostly with loads inside loop with high trip count – Targeting these loads can significantly reduce profiling overhead • Stride profile is statistically stable – Sampling can be used While (P) { prefetch(P+120) L: D= P->data Use (D) P = P->next } CSC573, Computer Science, U. of Rochester CSC573, Computer Science, U. of Rochester 83 CSC573, Computer Science, U. of Rochester 84 14 Sampling Methods • • • Sampling Example Profiling a small sample of loads in high-trip count loops Take place inside strideProf routine Chunk sampling – After every N1 references are skipped, profile the next N2 references (N1 = 8 millions, N2 = 2 millions) Fine sampling of references from a static load – Inside each chunk, profile one after every F (e.g. 4) references Sampling ratio • • Addresses seen by strideProf: {N1 refs}2,4,6,8,10,12,14,16,18,20,22,24,26,28,…{N1 refs}.. N2 refs (assume from the same static load) Addresses with chunk sampling: 2,4,6,8,10,12,14,16,18,20,22,24,26,28,….. Addresses with fine sampling F = 4: N2 1 2 1 1 × = × = = 4% N1 + N 2 1 + F 10 5 25 2, , , ,10, , , ,18, ,... Stride with both sampling = 8 Original stride = 8/F = 2 CSC573, Computer Science, U. of Rochester 85 CSC573, Computer Science, U. of Rochester Performance Gains Other Stride Patterns • • • 86 Compare with no-prefetching Profiles collected with train input, execution with ref input • We have discovered “self-stride”: difference between successive addresses of the same load “cross-stride”: difference between addresses of dependent but different loads 1.60 edge-check naïve-loop naïve-all sample-edge-check sample-naïve-loop sample-naïve-all 1.54 1.48 1.42 speedup 1.36 Stride Stride 1.30 1.24 x=p 1.18 -> brandInfo -> value 1.12 1.06 16 4. gz ip 17 5. vp r 17 6. gc c 18 1. m cf 18 6. cr af ty 19 7. pa rs er 25 2. eo 25 n 3. pe rlb m k 25 4. ga p 25 5. vo rte x 25 6. bz ip 2 30 0. tw ol f av er ag e 1.00 0.94 CSC573, Computer Science, U. of Rochester 87 CSC573, Computer Science, U. of Rochester Recent Extensions • • 88 PMU-based Stride Profiling Other Stride Profiling Methods – PMU-based sampling [Luk-04] – Partial interpretation [Inagaki-03] – PMU+GC analysis [Adl-Tabatabai-04] • PMU identify delinquent loads • GC identify strides and maintain them Prefetching more delinquent loads [Inagaki-03] [AdlTabatabai-04] – Prefetch “cross-stride” loads – Create more stride patterns for delinquent loads • Sample load misses with 2 phases: Skipping phases (1 sample per 1000 misses) Time A1 A2 A2-A1=5*48=240 Inspection phases (1 sample per miss) A3 A4 A3-A2=7*48=336 Time A4-A3=3*48=144 A1, A2, A3, A4 are four consecutive miss addresses of a load. The load has a stride of 48 bytes. • Via memory management Use GCD to figure out strides from miss addresses: GCD(A2-A1, A3-A2 )=GCD(240,336)=48 GCD(A3-A2, A4-A3 )=GCD(336,144)=48 CSC573, Computer Science, U. of Rochester 89 CSC573, Computer Science, U. of Rochester 90 15 Partial Interpretation in JIT • • Before translating a method – Interpret each loop a few iterations (e.g. 20) – Identify both self-strides and cross-strides Translate the method with prefetching – Can achieve better performance than prefetching self-strides alone • • Sampling Cache Misses Self-stride Prefetch • tmp = prefetch a[i+K] prefetch(tmp+stride1) prefetch(tmp+stride2) Loop – tmp = load a[i] • HW PMU delivers samples – IP of the load causing the miss – Effective address of miss – Miss latency – Low overhead PMU Metadata Paths Cross-stride Prefetch Strides tmp = spec_load a[i+K] – tmp1 = load tmp->field1 – tmp2 = load tmp1->field2 CSC573, Computer Science, U. of Rochester Inject 91 CSC573, Computer Science, U. of Rochester 92 Discovering Delinquent Paths Finding Strides Along Path • Delinquent paths abstract high latency linked data structure – Approximates traversals causing latency • Discover edges along the paths PMU – Piggyback on GC mark phase traversal – Characterize edges GC encounters Metadata – Glean global properties about edges – Use characterization to estimate path Paths • Combine the edges into paths PMU Metadata Paths strides Strides Inject Inject CSC573, Computer Science, U. of Rochester • Process set of delinquent objects • If type matches the base of a delinquent path – Traverse path – Summarizing strides between base and objects along the path • Useful strides exist even without proactive placement by GC • Proactive placement by GC enables more loads can be stride-prefetched 93 CSC573, Computer Science, U. of Rochester Maintaining strides PMU Metadata Paths strides Inserting Prefetches • GC must be stride aware • Allocation order placement – Frontier pointer allocation creates useful strides – Sliding compaction algorithm maintains allocation order strides – Various GC algorithms break/alter strides • Expands GC role into a memory hierarchy controller and optimizer PMU Metadata Paths Strides • IPs associated with path start are prefetch point candidates • Screening a candidate point: – Must contributes a significant number of the base type misses – Use JIT analysis to locate earliest availability of the pointer used to load base • Finally, recompile with prefetches – A speedup of 11-14% for SPEC JBB2000 Inject Inject stride2 Stride1 p CSC573, Computer Science, U. of Rochester 94 95 -> field_of_p -> value CSC573, Computer Science, U. of Rochester tmp=load tmp=load [p] prefetch(p+stride1) prefetch(p+stride2) 96 16 Summary Summary • • • • Stride profiling technique can discover stride pattern efficiently – Instrumentation based approach is only 17% slower than the frequency profiling alone • Transparent to users for a compiler with edge frequency profiling – PMU or interpretation based techniques can also discover stride patterns Significant performance improvement – Average 7% for SPECINT2000 suite for self-stride prefetching • 1.59x speedup for "181.mcf", 1.14x for "254.gap", and 1.08x for "197.parser". – Performance gain is stable across profiling data sets. Cross-stride prefetching can contribute additional performance gains – Demonstrated in Java JIT environment CSC573, Computer Science, U. of Rochester • • • 97 Dependence – dependence, temporal, spatial group, and self reuse – dependence analysis – loop permutation, tiling, prefetching Stream – hot-streams – off-line and run-time measurements – structure splitting, prefetching Distance – reuse distance, approximate measurement – reuse behavior prediction across program inputs – miss-rate prediction across program inputs Stride – stride patterns – profiling and sampling – prefetching CSC573, Computer Science, U. of Rochester 98 17