Marwa A. Al-Shandawely PDC/KTH

Marwa A. Al-Shandawely PDC/KTH        Algorithm overview. Trivial parallelization. Problems. Sequential optimization Proposed solutions. Experimental results. Conclusions and future work. for i=1 to n-1 find pivotPos in column i if pivotPos ≠ i exchange rows(pivotPos,i) end if for j=i+1 to n A(i,j) = A(i,j)/A(i,i) end for j !$omp parallel do private ( i ,j ) for j=i+1 to n+1 for k=i+1 to n A(k,j)=A(k,j)-A(k,i)×A(i,j) end for k end for j end for i 8 7 6 5 4 3 2 1 0 2 3 N=1000 4 5 6 7 N=2000 N=3000 N=4000 N=5000 8 nThreads     Poor data locality Pivoting is done by master thread Overheads of creating and destroying threads at each iteration Sequential optimization     Replace division by the constant pivot Avoid loop invariant access in the inner most loop Eliminate the check for pivot changing position Make use of fortran array notation Do k=j+1,n A(k,j)=A(k,j)/A(j,j) End do C=1/A(j,j) A(j+1:n)=A(j+1:n)*c Pivots array    Pivots array Locks array Pivot holder ◦ ◦ ◦ ◦ ◦ ◦ P1 Eliminate (i) on column(i+1) Search (i+1) Store pivot (i+1) position Prepare colmn (i+1) Free lock (i+1) Eliminate (i) on rest of scope P2 P3 P4 Locks 12 10 8 6 4 2 0 1 2 N=1000 3 4 N=2000 N=3000 5 N=4000 6 N=5000 7 nThreads    The original algorithm requires pivot columns to be prepared in order while the whole matrix is accessed for each pivot column. For large input sizes; the cache is evicted many times for each iteration and there is no reuse of data in the cache. False sharing on pivots and locks array.  Double elimination on pivot holders. ◦ Knowledge of two pivots allow data reuse.  Each column is an accumulation of eliminations using previous columns! ◦ Make more pivots available each step and eliminate each column using several pivots while it is in the cache.      Pivots array Block of pivots Increase work/iter. Increase locality Less locks Load balancing?! P1 P2 P3 P4 Locks      Pivots array Block of pivots Increase work/iter. Increase locality Less locks Load balancing?! P1 P2 P3 P4 Locks N=2000 N=5000 9 16 8 14 7 12 Original 6 10 C=1 5 8 C=2 4 6 C=3 3 4 C=4 2 2 1 0 0 2 3 4 5 6 7 8 C=5 2 3 4 5 6 7 8 N=2000 N=5000 30 25 25 20 20 Original 15 15 double elimination 10 10 C=25 with double elimination 5 5 0 0 2 3 4 5 6 7 8 2 3 4 5 6 7 8  Scalable performance on multicores is highly dependent on application implementation, data layout and access patterns.  Cache and memory access optimization techniques is vital for performance despite the loss of readability.  Future work: ◦ Adaptive blocking scheme that changes the block size as a function of the matrix size, cache settings, and number of cores.

Marwa A. Al-Shandawely PDC/KTH

Related documents

Products

Support

Marwa A. Al-Shandawely PDC/KTH

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib