MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani Seth Copen Goldstein Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 1 Outline • • • • • Motivation #pragma independent Automated Annotation Evaluation Conclusion Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 2 Problem Potentially aliasing pointers inhibit compiler optimization. Fully determining pointer aliasing may be infeasible or expensive. How to get the benefit without paying the cost? Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 3 Memory Dependencies Memory dependencies inhibit optimization • Introduce edges into dependence graph • Limits parallelization • Inhibits code motion – instruction scheduling – loop invariant code motion – partial redundancy elimination – register promotion Breaking memory dependencies difficult • compile-time analysis infeasible or expensive • run-time analysis limited to local window Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 4 Examples .L26: while(len--) { *p++ = *q++; } There is a real data dependence between the load and store within a single iteration. Unroll loop to exploit parallelism Itanium assembly from gcc without memory dependence Memory Systems Performance Workshop 2004 .L26: mov r24 = r33 mov r17 = r32 adds r22 = 8, r33 adds r19 = 8, r32 adds r20 = 12, r33 adds r21 = 12, r32 ;; ld4 r14 = [r24], 4 adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r17] = r14, 4 ld4 r23 = [r24] ;; st4 [r17] = r23 ld4 r18 = [r22] ;; st4 [r19] = r18 ld4 r16 = [r20] ;; st4 [r21] = r16 br.cloop .L26 ;; © David Ryan Koes 2004 mov r18 = r33 mov r23 = r32 adds r25 = 8, r33 adds r24 = 12, r33 adds r22 = 8, r32 adds r21 = 12, r32 ;; ld4 r14 = [r18], 4 ld4 r19 = [r25] adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r23] = r14, 4 ld4 r16 = [r18] ld4 r20 = [r24] ;; .mmb st4 [r23] = r16 st4 [r22] = r19 st4 [r21] = r20 br.cloop .L26 ;; 5 Examples for(i = 0; i < len; i++) { ... ... = *q; loop invariant ... code motion *p = ... } t0 = *q; for(i = 0; i < len; i++) { ... ... = t0; ... t1 = ... } *p = t1; if loop was executed Memory Systems Performance Workshop 2004 t0 = *q; if loop will be executed for(i = 0; i < len; i++) { ... ... = t0; ... *p = ... } Hardware can’t do this © David Ryan Koes 2004 6 Pointer Analysis Memory Disambiguation is important • hardware can’t do everything • so have compiler figure it out... int p[10]; foo() { int q[10]; ... } easy! Memory Systems Performance Workshop 2004 foo() foo(int *p, int *q) { { int *p, *q; ... int a,b; } if(...) requires { p = &a; inter-procedural q = &b; information } else { p = &b; q = &a; } ... harder.. need precise } dataflow analysis © David Ryan Koes 2004 7 Inter-procedural Pointer Analysis • Just apply same techniques as used for intraprocedural • may not be possible – gcc -c foo.c • may not be feasible – n2 analysis on source code of Microsoft Office? • Use less precise analysis • still might not be possible (separate compilation, libraries) • still takes time (every time you compile, or at least link) • less precise » less optimization Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 8 Alternative: Have Programmer Do It Programmer annotates source code • informs compiler of pointer relationships Previous Work • ANSI C99 restrict keyword – difficult for compiler and programmer to reason about – non-local semantics • MIPSpro #pragma ivdep – break loop carried dependence in inner loop Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 9 Outline • • • • • Motivation #pragma independent Automated Annotation Evaluation Conclusion Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 10 #pragma independent Syntax #pragma independent ptr1 ptr2 Example int x[100] int y; malloc_site_2 y malloc_site_1 void foo(int *a, int *b) arr x { #pragma independent a b int arr[50]; … } pointers guaranteed to always point to different objects Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 11 Examples void f(int len, int * p, int * q) { #pragma independent p q while (len--) *p++ = *q++; } void example(int *a, int *b, int *c) { #pragma independent a b #pragma independent a c (*b)++; *a = *b; *a = *a + *c; pragmas allow compiler to } eliminate a store to *a Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 12 #pragma independent Advantages • more flexible and powerful than restrict • relationships between pointers explicit • easy to reason about – effects only listed pointers • easy to implement in compiler – fewer than 100 lines of code Possible Disadvantage • could take programmer a lot of time to annotate existing source Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 13 Outline • • • • • Motivation #pragma independent Automated Annotation Evaluation Conclusion Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 14 Automated Annotation Toolflow compiler *.c *.h executable with runtime checks execution inputs candidate pointer pairs static scores script pragma annotations ranked by score programmer source code with verified pragmas pragma aware compiler Memory Systems Performance Workshop 2004 invalid pointer pairs execution frequencies Compiler finds interesting pointer pairs • pairs which inhibit optimization • pairs whose aliasing is unknown Inserts profiling code and checks faster executable © David Ryan Koes 2004 15 Automated Annotation Toolflow compiler *.c *.h executable with runtime checks execution inputs candidate pointer pairs static scores script pragma annotations ranked by score programmer source code with verified pragmas pragma aware compiler Memory Systems Performance Workshop 2004 invalid pointer pairs execution frequencies Instrumented executable run on input • records pointers which conflict • counts number of pointer uses faster executable © David Ryan Koes 2004 16 Automated Annotation Toolflow compiler *.c *.h executable with runtime checks execution inputs candidate pointer pairs static scores script pragma annotations ranked by score programmer source code with verified pragmas pragma aware compiler Memory Systems Performance Workshop 2004 invalid pointer pairs execution frequencies Script combines static and dynamic info • eliminates conflicting pairs • assigns score to each pair faster executable © David Ryan Koes 2004 17 Automated Annotation Toolflow compiler *.c *.h executable with runtime checks execution inputs candidate pointer pairs static scores script invalid pointer pairs execution frequencies pragma annotations ranked by score programmer source code with verified pragmas pragma aware compiler Memory Systems Performance Workshop 2004 Programmer verifies pointer pairs • can verify high scoring pairs only faster executable © David Ryan Koes 2004 18 Example Output void summer(int *p, int *q, int n, int { #pragma independent p q /* score: 1100 #pragma independent p result /* score: #pragma independent q result /* score: int i, sum = 0; for(i = 0; i < n; i++) { *p += *q; sum += *q; } *result = sum; } Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 *result) */ 15 */ 12 */ 19 Sample Score Distribution 400 Dynamic Score Static Score Number of pairs 350 300 250 200 150 100 50 0 0% Memory Systems Performance Workshop 2004 10% 20% 30% 40% 50% 60% 70% Percentile of maximum score © David Ryan Koes 2004 80% 90% 20 Outline • • • • • Motivation #pragma independent Automated Annotation Evaluation Conclusion Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 21 Targets & Benchmarks Targets • Itanium • EPIC/VLIW architecture • instruction scheduling important for good performance • ASH (Application Specific Hardware) • can take full advantage of parallelism Benchmarks • Mediabench • small, multimedia applications • can’t time accurately on Itanium • Spec95, Spec2000 • general purpose integer • longer running – sometimes days for ASH simulation Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 22 Compilers Compilers • gcc • not very sophisticated optimizations • -funroll-loops -O2 • CASH • more sophisticated optimizations • memory dependencies are first class objects – token edge – pragma independent removes edge Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 23 Questions Do we find a reasonable number of potential annotations? • Yes! Do the annotations result in faster code? • Yes! Does our scoring mechanism find the pointer pairs with the biggest impact on performance? • Yes! How much time does the programmer have to spend verifying pragmas? • Not a lot! Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 24 12 4 1 2 .m 9. 88 co ks m im pr es 13 130 s 2. .li 13 ijpe 4. g 17 per 5 l 1 8 . vp a d 1 .m r p c ad cm f pc _d m ep _e ic ep _d i g7 c_e 21 g7 _d 21 gs _e m gs _d m jp _e eg jp _d eg _ m m e e pe s m g2 a pe _ pe g2 d g _ pe wit e gw _d 17 it_ 19 6 e 7. .gc 25 par c 6 . se b r 16 300 zip 8. .tw 2 wu o 17 pw lf 1 is 17 .sw e 2 i 17 .mg m 3 r 17 .ap id 18 7. plu 3. me 18 equ sa 8. ak am e 30 m 1. p ap si Number of pointer pairs Annotations Found 300 49 0 74 4 200 150 3 Memory Systems Performance Workshop 2004 41 8 100 50 41 32 0 45 3 97 9 34 70 36 36 34 34 0 © David Ryan Koes 2004 45 1 95 0 12 12 3 0 7 46 3 250 25 2 18 8 15 9 13 2 11 9 unchecked conflict no conflict useful 94 72 56 40 30 0 2 Benchmark 25 Do the annotations result in faster code? Itanium Speedup Of 19 Spec benchmarks, these were the only ones to demonstrate measurable speedup Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 26 Do the annotations result in faster code? CASH Speedup Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 27 Does our scoring mechanism work? mpeg2_e all (68) Number of highest scoring pragmas Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 28 How much time does the programmer have to spend? Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 29 Verified Speedup Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 30 Conclusions • We’ve performed a limit study of pointer analysis • gcc doesn’t fully exploit the results of pointer analysis • CASH and ASH can fully exploit parallelism • Programmer specified annotations are effective • faster and more flexible than inter-procedural analysis • Annotations can be automatically generated • automatic score successfully focuses programmer’s attention • manual verification does not take long Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 31 Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 32 ANSI C99 restrict keyword An object that is accessed through a restrict-qualified pointer has a special association with that pointer. This association, defined in 6.7.3.1 below, requires that all accesses to that object use, directly or indirectly, the value of that particular pointer.) The intended use of the restrict qualifier (like the register storage class) is to promote optimization, and deleting all instances of the qualifier from all preprocessing translation units composing a conforming program does not change its meaning (i.e., observable behavior). ISO/IEC 9899 Second edition 1999-12-01 6.7.3-7 Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 33 restrict Example void f(int len, int * restrict p, int * restrict q) { while (len--) *p++ = *q++; } restrict tells the compiler that p and q refer to different objects, enabling optimizations Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 34 Problems with restrict 6.7.3.1 Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 35 gcc’s restrict Implementation • No two restricted pointers can alias • A restricted pointer and an unrestricted pointer may alias This definition is intuitive for both the programmer and compiler But not the C99 definition! Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 36