ppt slides - Microsoft Research

advertisement
Relational Verification
to
SIMD Loop Synthesis
Mark Marron – IMDEA & Microsoft Research
Sumit Gulwani – Microsoft Research
Gilles Barthe, Juan M. Crespo, Cesar Kunz – IMDEA
General SIMD Compilation
Compilers struggle to utilize SIMD operations in general purpose code
◦ Text processing, web browser, compiler, etc.
◦ Standard library code (C++ STL, .Net BCL) they utilize
Challenges
◦ Data structure layouts of composite types
◦ Complex data driven control flow
◦ Wide ranging code restructuring is often needed
We know most of the needed “tricks” but:
◦ Time and implementation effort too large to identify and implement all of them
◦ “An Evaluation of Vectorizing Compilers” PACT ‘11
Example (Exists Function)
struct { int tag; int score; } widget;
int exists(widget* vals, int len, int t, int s)
{
for(int i = 0; i < len; ++i)
{
int tagok = vals[i].tag == t;
int scoreok = vals[i].score > s;
int andok = tagok & scoreok;
if(andok) return 1;
}
return 0;
}
SIMD Example (Exists Function)
…
for(; i < (len - 3); i += 4)
{
]
m128i blck1 = [t
load_128(vals,
i);
i, si, ti+1, si+1
m128i blck2 = [t
load_128(vals,
+ ]4);
i+2, si+2, ti+3, isi+3
m128i tagvs = [t
shuffle_i32(blck1,
i, ti+1, ti+2, ti+3] blck2, ORDER(0, 2, 0, 2));
m128i scorevs = [s
shuffle_i32(blck1,
blck2, ORDER(1, 3, 1, 3));
i, si+1, si+2, si+3]
…, ti+3==t ? 0xF…F : 0x0]
i==t ? 0xF…F : 0x0,
m128i cmptag = [t
cmpeq_i32(vectv,
tagvs);
…, si+3>s ? 0xF…F : 0x0]
m128i cmpscore = [s
cmpgt_i32(vecsv,
scorevs);
i>s ? 0xF…F : 0x0,
m128i cmpr = and_i128(cmptag,
cmpscore);
[cmptag0 & cmpscore
0, …, cmptag3 & cmpscore3]
int match = (cmpr
!allzeros(cmpr);
0!=0 | cmpr1!=0 | cmpr2!=0 | cmpr3!=0)
if (match) return 1;
}
…
Performance Impact
Exists Speedup
2.5
Speedup
2
1.5
1
0.5
4
8
16
32
64
128
Array Size
256
512
1024
2048
Overview of Approach
Deductive Rewriting of program source to:
◦ Identify high-level structures of interest
◦ Rewrite to expose latent parallelism (split, unroll, etc.) and straighten hot-paths
Relational Verification techniques used to:
◦ Construct the needed synthesis conditions (for code involving loops!)
◦ Produce proof for semantic equivalence of input and result code
Inductive Synthesis of SIMD program fragments to:
◦ Identify the best SIMD realizations of the synthesis conditions
◦ Produce proofs of correctness wrt. synthesis conditions
Methodology more general than just SIMD Loops!
From Verification to
Synthesis Condition Generation
Relational Verification:
◦ Prove two programs equivalent under equivalence relations on states
◦ y=x
◦ y = x 1 + x 2 + x 3 + x4
◦ y=5
◦ Only a few standard equivalence relations needed in practice
Prove results of two programs are equivalent by showing:
◦ If the programs are synchronously executed then at synchronization points the program
states are always equivalent under the relations
◦ For our purposes at the start and end of the loop body
Relational Verification
int suml = 0;
for(int i = 0; i < len; i+=4)
{
suml = suml + A[i];
suml = suml + A[i+1];
suml = suml + A[i+2];
suml = suml + A[i+3];
}
Full Loop Invariant:
int sumr = 0;
int as0, as1, as2, as3 = 0;
for(int i = 0; i < len; i+=4)
{
as0 = as0 + A[i];
as1 = as1 + A[i+1];
as2 = as2 + A[i+2];
as3 = as3 + A[i+3];
}
sumr = as0 + as1 + as2 + as3;
Relational Invariant:
𝑖−1
𝑠𝑢𝑚𝑙 =
𝐴[𝑗]
𝑗=0
𝑠𝑢𝑚𝑙 = 𝑎𝑠0 + 𝑎𝑠1 + 𝑎𝑠2 + 𝑎𝑠3
From Verification to Condition
Generation
We use “Product Programs” approach
◦
◦
◦
◦
“Relational verification using product programs” FM ‘11
Rename variables in “left” and “right” programs disjointly
Interleave the programs “appropriately”
Generates verification conditions on the combined program
Key Idea:
◦ Replace code in “right” program with uninterpreted Function (𝝋)
◦ Perform Product program construction and VC generation
◦ Resulting VC for 𝝋 are needed synthesis pre/post conditions
Relational Synthesis Condition
int suml = 0;
for(int i = 0; i < len; i+=4)
{
suml = suml + A[i];
suml = suml + A[i+1];
suml = suml + A[i+2];
suml = suml + A[i+3];
}
Relational Invariant:
𝑠𝑢𝑚𝑙 = 𝑎𝑐. 0 + 𝑎𝑐. 1 + 𝑎𝑐. 2 + 𝑎𝑐. 3
int sumr = 0;
m128i ac = [0, 0, 0, 0];
for(int i = 0; i < len; i+=4)
{
𝜑 𝑎𝑐, 𝑠𝑢𝑚𝑟, 𝐴, 𝑖, 𝑙𝑒𝑛 ;
}
sumr = ac.0 + ac.1 + ac.2 + ac.3;
Resulting Synthesis Condition
Pre-condition:
◦ ac == [v1, v2, v3, v4]
Post-condtion:
◦ ac == [v1 + A[i], v2 + A[i+1], v3 + A[i+2], v4 + A[i+3]]
Instruction Sequence Search
Search space for SIMD instruction sequences is large
◦ Length: frequently need 8 or more instructions
◦ Branching: SSE has 200+ instructions
Concrete state space exploration
◦ Explore program states instead of instruction sequences
◦ Use concrete execution to quickly exclude many candidate instruction sequences
Query SMT solver for a counter example input
◦ Eventually either no counter examples or give up
Search for alternative sequences
◦ Can generate multiple solutions to find best performance on varying data sizes
Optimize Search
Cost model provides upper bound on depth of search
◦ Also used to pick best operation to explore next and to pick shortest path from input to
output state
Incrementally expand available instruction set
◦ Start with standard operations (and those seen in input code)
◦ Add more specialized operations if desired
Generate multiple initial input-output pairs
◦ One per path in original loop body
Stack machine construction to reduce the branching factor
Cost Model
Do not want to compute absolute costs
◦ A very hard problem
Compute relative costs
◦ Both programs run on the same data so same cache misses and branch taken/not taken
◦ Build simple machine model to encapsulate instruction costs
Cost function a polynomial in terms of loop counts and branch rates
◦ Use conservative static estimates for synthesis
◦ Can use runtime data for selection in JIT setting
Complete Algorithm
Input Program
Input Program
…
… for(i ∈ I by c)
for(i
∈ I by c)
{
{
}
} …
…
Restructure
Loop
Restructured
Restructured
Program
… Program
CPU Model
Cost Ranking
Function
…
for(i ∈ I by 4c)
for(i
∈ I by 4c)
{
{
}
Optimistic
Vectorize
Cost
Score
…
Merge &
Cleanup
}
…
…
for(i ∈ I by 4c)
for(i
∈ I by 4c)
{
{
}
Body
Simulation Relation (Eq)
Synthesize
Synth. Cond.
Generation
Synthesis Cond.
Final SIMD
Final SIMD
Program
Program
…
…
}
…
Correctness
Proof
SIMD Standard Library
Synthesize SIMD implementations of C++ STL and .Net BCL code
Consistent performance improvements
◦ Between 2x-4x on large inputs
◦ Avoid performance degradation on small inputs
Cost model accurately predicts performance
◦ Can pick best implementation based on hardware and input data
Library Function Performance
Cyclic Hash
CountIf
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
4
8
16
32
Actual
64
128
256
512
Predicted
1024 2048
4
8
16
32
Actual
64
128
256
512
Predicted
1024 2048
String Processing
Synthesize standard string functions using PCMPESTRI
◦ Packed Compare Explicit Length Strings, Return Index
Encoded semantics and provided them to synthesizer
◦ Synthesized range of common string functions with no other changes
◦ Speedup of 3.4x for String.Equals
◦ Speedup up to 9.5x for String.IndexOfAny
Impact In Practice
483.Xalan (SPEC CPU)
XML processing framework written in C++
Replaced STL calls with our SIMD implementations
Performance sensitive to input data
◦ Previous work replacing these calls with set structures was +15% to -20% on different data
Synthesized SIMD code produces consistent 2%-5% speedup
◦ Indicates a 1.15x to 1.5x speedup in the STL code which is inline with cost model predictions
Benefits of Approach
Proof of correctness from original loop and SIMD version
Separation of correctness and optimization
◦ Transform for performant code structure
◦ If incorrect proof (or synthesis) will fail later
Approach consistently produces fast SIMD code
◦ Robust to details of SIMD instruction set and loop patterns
◦ 2x-4x speedups obtained from synthesized SIMD code
Future Work
Pointers and object structures
◦ Scatter-Gather support will help
◦ Compact object graphs into arrays (current work)
◦ Can we do local data structure transformations?
Apply technique to larger structures and more generally
◦ What about loops with small inner-loops (HashTable lookup)?
◦ Can we use synthesis as part of general code-gen?
Big Picture Conclusions
Big challenges and big benefits using specialized hardware
◦ Both performance and power!
Synthesis complements compilation
◦ Small step vs. big step code generation
◦ Verification structures synthesis (and eliminates compilation bugs)
◦ Can we apply ideas to other compiler actions? Target other hardware?
Idea more general than just compilers or SIMD synthesis
◦ Expert provided deductive structure
◦ Inductive synthesis driven by underlying semantics
◦ A powerful combination for approaching problems
Questions
Download