Programmer Specified Pointer Independence MSP 2004 David Koes

advertisement
MSP 2004
Programmer Specified
Pointer Independence
David Koes
Mihai Budiu
Girish Venkataramani
Seth Copen Goldstein
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
1
Outline
•
•
•
•
•
Motivation
#pragma independent
Automated Annotation
Evaluation
Conclusion
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
2
Problem
Potentially aliasing pointers inhibit
compiler optimization.
Fully determining pointer aliasing may
be infeasible or expensive.
How to get the benefit without paying
the cost?
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
3
Memory Dependencies
Memory dependencies inhibit optimization
• Introduce edges into dependence graph
• Limits parallelization
• Inhibits code motion
– instruction scheduling
– loop invariant code motion
– partial redundancy elimination
– register promotion
Breaking memory dependencies difficult
• compile-time analysis infeasible or expensive
• run-time analysis limited to local window
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
4
Examples
.L26:
while(len--)
{
*p++ = *q++;
}
There is a real data
dependence between the
load and store within a
single iteration.
Unroll loop to exploit
parallelism
Itanium assembly from gcc
without memory dependence
Memory Systems Performance Workshop 2004
.L26:
mov r24 = r33
mov r17 = r32
adds r22 = 8, r33
adds r19 = 8, r32
adds r20 = 12, r33
adds r21 = 12, r32
;;
ld4 r14 = [r24], 4
adds r33 = 16, r33
adds r32 = 16, r32
;;
st4 [r17] = r14, 4
ld4 r23 = [r24]
;;
st4 [r17] = r23
ld4 r18 = [r22]
;;
st4 [r19] = r18
ld4 r16 = [r20]
;;
st4 [r21] = r16
br.cloop .L26
;;
© David Ryan Koes 2004
mov r18 = r33
mov r23 = r32
adds r25 = 8, r33
adds r24 = 12, r33
adds r22 = 8, r32
adds r21 = 12, r32
;;
ld4 r14 = [r18], 4
ld4 r19 = [r25]
adds r33 = 16, r33
adds r32 = 16, r32
;;
st4 [r23] = r14, 4
ld4 r16 = [r18]
ld4 r20 = [r24]
;;
.mmb
st4 [r23] = r16
st4 [r22] = r19
st4 [r21] = r20
br.cloop .L26
;;
5
Examples
for(i = 0; i < len; i++)
{
...
... = *q;
loop invariant
...
code motion
*p = ...
}
t0 = *q;
for(i = 0; i < len; i++)
{
...
... = t0;
...
t1 = ...
}
*p = t1; if loop was executed
Memory Systems Performance Workshop 2004
t0 = *q; if loop will be executed
for(i = 0; i < len; i++)
{
...
... = t0;
...
*p = ...
}
Hardware can’t do this
© David Ryan Koes 2004
6
Pointer Analysis
Memory Disambiguation is important
• hardware can’t do everything
• so have compiler figure it out...
int p[10];
foo()
{
int q[10];
...
}
easy!
Memory Systems Performance Workshop 2004
foo()
foo(int *p, int *q)
{
{
int *p, *q;
...
int a,b;
}
if(...)
requires
{
p = &a;
inter-procedural
q = &b;
information
}
else
{
p = &b;
q = &a;
}
... harder.. need precise
}
dataflow analysis
© David Ryan Koes 2004
7
Inter-procedural Pointer Analysis
• Just apply same techniques as used for
intraprocedural
• may not be possible
– gcc -c foo.c
• may not be feasible
– n2 analysis on source code of Microsoft Office?
• Use less precise analysis
• still might not be possible (separate compilation, libraries)
• still takes time (every time you compile, or at least link)
• less precise » less optimization
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
8
Alternative: Have Programmer Do It
Programmer annotates source code
• informs compiler of pointer relationships
Previous Work
• ANSI C99 restrict keyword
– difficult for compiler and programmer to reason about
– non-local semantics
• MIPSpro #pragma ivdep
– break loop carried dependence in inner loop
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
9
Outline
•
•
•
•
•
Motivation
#pragma independent
Automated Annotation
Evaluation
Conclusion
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
10
#pragma independent
Syntax
#pragma independent ptr1 ptr2
Example
int x[100]
int y;
malloc_site_2
y
malloc_site_1
void foo(int *a, int *b)
arr
x
{
#pragma independent a b
int arr[50];
…
} pointers guaranteed to always point to different objects
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
11
Examples
void f(int len, int * p, int * q)
{
#pragma independent p q
while (len--)
*p++ = *q++;
}
void example(int *a, int *b, int *c)
{
#pragma independent a b
#pragma independent a c
(*b)++;
*a = *b;
*a = *a + *c;
pragmas allow compiler to
}
eliminate a store to *a
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
12
#pragma independent
Advantages
• more flexible and powerful than restrict
• relationships between pointers explicit
• easy to reason about
– effects only listed pointers
• easy to implement in compiler
– fewer than 100 lines of code
Possible Disadvantage
• could take programmer a lot of time to annotate
existing source
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
13
Outline
•
•
•
•
•
Motivation
#pragma independent
Automated Annotation
Evaluation
Conclusion
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
14
Automated Annotation Toolflow
compiler
*.c *.h
executable with
runtime checks
execution
inputs
candidate pointer pairs
static scores
script
pragma annotations
ranked by score
programmer
source code with
verified pragmas
pragma
aware
compiler
Memory Systems Performance Workshop 2004
invalid pointer pairs
execution frequencies
Compiler finds interesting pointer pairs
• pairs which inhibit optimization
• pairs whose aliasing is unknown
Inserts profiling code and checks
faster executable
© David Ryan Koes 2004
15
Automated Annotation Toolflow
compiler
*.c *.h
executable with
runtime checks
execution
inputs
candidate pointer pairs
static scores
script
pragma annotations
ranked by score
programmer
source code with
verified pragmas
pragma
aware
compiler
Memory Systems Performance Workshop 2004
invalid pointer pairs
execution frequencies
Instrumented executable run on input
• records pointers which conflict
• counts number of pointer uses
faster executable
© David Ryan Koes 2004
16
Automated Annotation Toolflow
compiler
*.c *.h
executable with
runtime checks
execution
inputs
candidate pointer pairs
static scores
script
pragma annotations
ranked by score
programmer
source code with
verified pragmas
pragma
aware
compiler
Memory Systems Performance Workshop 2004
invalid pointer pairs
execution frequencies
Script combines static and dynamic info
• eliminates conflicting pairs
• assigns score to each pair
faster executable
© David Ryan Koes 2004
17
Automated Annotation Toolflow
compiler
*.c *.h
executable with
runtime checks
execution
inputs
candidate pointer pairs
static scores
script
invalid pointer pairs
execution frequencies
pragma annotations
ranked by score
programmer
source code with
verified pragmas
pragma
aware
compiler
Memory Systems Performance Workshop 2004
Programmer verifies pointer pairs
• can verify high scoring pairs only
faster executable
© David Ryan Koes 2004
18
Example Output
void summer(int *p, int *q, int n, int
{
#pragma independent p q /* score: 1100
#pragma independent p result /* score:
#pragma independent q result /* score:
int i, sum = 0;
for(i = 0; i < n; i++)
{
*p += *q;
sum += *q;
}
*result = sum;
}
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
*result)
*/
15 */
12 */
19
Sample Score Distribution
400
Dynamic Score
Static Score
Number of pairs
350
300
250
200
150
100
50
0
0%
Memory Systems Performance Workshop 2004
10%
20%
30% 40% 50% 60% 70%
Percentile of maximum score
© David Ryan Koes 2004
80%
90%
20
Outline
•
•
•
•
•
Motivation
#pragma independent
Automated Annotation
Evaluation
Conclusion
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
21
Targets & Benchmarks
Targets
• Itanium
• EPIC/VLIW architecture
• instruction scheduling important for good performance
• ASH (Application Specific Hardware)
• can take full advantage of parallelism
Benchmarks
• Mediabench
• small, multimedia applications
• can’t time accurately on Itanium
• Spec95, Spec2000
• general purpose integer
• longer running
– sometimes days for ASH simulation
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
22
Compilers
Compilers
• gcc
• not very sophisticated optimizations
• -funroll-loops -O2
• CASH
• more sophisticated optimizations
• memory dependencies are first class objects
– token edge
– pragma independent removes edge
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
23
Questions
Do we find a reasonable number of potential
annotations?
• Yes!
Do the annotations result in faster code?
• Yes!
Does our scoring mechanism find the pointer
pairs with the biggest impact on
performance?
• Yes!
How much time does the programmer have to
spend verifying pragmas?
• Not a lot!
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
24
12
4
1 2 .m
9. 88
co ks
m im
pr
es
13 130 s
2. .li
13 ijpe
4. g
17 per
5 l
1 8 . vp
a d 1 .m r
p c
ad cm f
pc _d
m
ep _e
ic
ep _d
i
g7 c_e
21
g7 _d
21
gs _e
m
gs _d
m
jp _e
eg
jp _d
eg
_
m
m e e
pe s
m g2 a
pe _
pe g2 d
g _
pe wit e
gw _d
17 it_
19 6 e
7. .gc
25 par c
6 . se
b r
16 300 zip
8. .tw 2
wu o
17 pw lf
1 is
17 .sw e
2 i
17 .mg m
3 r
17 .ap id
18 7. plu
3. me
18 equ sa
8. ak
am e
30 m
1. p
ap
si
Number of pointer pairs
Annotations Found
300
49 0 74 4
200
150
3
Memory Systems Performance Workshop 2004
41 8
100
50
41
32
0
45 3
97 9
34 70
36 36
34 34
0
© David Ryan Koes 2004
45 1
95 0
12 12
3
0
7
46 3
250
25 2
18 8
15 9
13 2
11 9
unchecked
conflict
no conflict
useful
94
72
56
40
30
0
2
Benchmark
25
Do the annotations result in faster code?
Itanium Speedup
Of 19 Spec benchmarks, these were the only ones to demonstrate measurable speedup
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
26
Do the annotations result in faster code?
CASH Speedup
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
27
Does our scoring mechanism work?
mpeg2_e
all (68)
Number of highest scoring pragmas
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
28
How much time does the programmer have to spend?
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
29
Verified Speedup
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
30
Conclusions
• We’ve performed a limit study of pointer analysis
• gcc doesn’t fully exploit the results of pointer analysis
• CASH and ASH can fully exploit parallelism
• Programmer specified annotations are effective
• faster and more flexible than inter-procedural analysis
• Annotations can be automatically generated
• automatic score successfully focuses programmer’s attention
• manual verification does not take long
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
31
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
32
ANSI C99 restrict keyword
An object that is accessed through a restrict-qualified pointer
has a special association with that pointer. This association,
defined in 6.7.3.1 below, requires that all accesses to that object
use, directly or indirectly, the value of that particular pointer.)
The intended use of the restrict qualifier (like the register
storage class) is to promote optimization, and deleting all
instances of the qualifier from all preprocessing translation units
composing a conforming program does not change its meaning
(i.e., observable behavior).
ISO/IEC 9899
Second edition
1999-12-01
6.7.3-7
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
33
restrict Example
void f(int len, int * restrict p, int * restrict q)
{
while (len--)
*p++ = *q++;
}
restrict tells the compiler that p and q refer
to different objects, enabling optimizations
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
34
Problems with restrict
6.7.3.1
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
35
gcc’s restrict Implementation
• No two restricted pointers can alias
• A restricted pointer and an unrestricted pointer may alias
This definition is intuitive for both the programmer and compiler
But not the C99 definition!
Memory Systems Performance Workshop 2004
© David Ryan Koes 2004
36
Download