CSE-633 Parallel Algorithms Spring 2009 Stationary Temperature Distribution In A Square Plate Jesper Dybdahl Hede UB#: 35964041 Problem to Solve n x n size plate. n2 fields. Heaters and coolers on edges n n 100 degrees -100 degrees Problem to Solve Iterative process: • In iteration i a new temperature value is calculated for each field on the plate, using the temperatures from iteration i-1: x 100 Tx , y (i ) -100 -100 y 100 Tx 1, y (i 1) Tx 1, y (i 1) Tx , y 1 (i 1) Tx , y 1 (i 1) 4 Solving the Problem • Several processors to solve the problem. • h horizontal • v vertical • h*v total processors. • Each processors gets an equal part of the plate, i.e. a smaller rectangle. • Each processor does all calculations within its rectangle and communicates values on edges with neighbouring processors. • Processors on edges will use values from heaters/coolers, i.e. static values. Algorithm 0 Initial calculations (find neighbours/setup arrays) 1 Loop: 2 Calculate new values for each field rectangle 3 Synchronize with neighbours 4 Gather and save result in text file Basically, the loop is being parallellized. Implementation h-by-v mesh Communication diameter: Θ(max(h,v)) Bisection width: Θ(min(h,v)) h-by-v in stead of x-by-x → More options for #PEs. Loop stop condition: Each processors communicates an INT to neighbors along with edge values. if(PE updates over threshold) Send 0 else Send min known value +1 min known value is min of own value and received values in last iteration. Any PE: If min known value > v+h then end loop -100 106 91 61 76 Series66 Series71 Series76 Series81 Series86 Series91 Series96 Series101 Series106 Series111 Series116 Series61 Series56 Series51 Series46 Series41 Series36 Series31 Series26 Series21 -80 Series16 -60 Series11 -40 Series6 01 16 31 -20 Series1 Result: Temperature in plate 100 80 60 40 20 80-100 60-80 46 40-60 20-40 0-20 -20-0 -40--20 -60--40 -80--60 -100--80 Problem! Program turns out to be less than perfect for parallelization. 1,4 1,2 Time (S) 1 0,8 0,6 0,4 0,2 0 4 6 #PEs 9 Problem! • MPI communication dominates arithmetic calculations • Not able to make plate as large as wanted (got error above 150x150 fields) • Solution? – Do more iterations of calculations between communication. • Reduces #sync rounds from ~2000 to ~25 2,5 Time (S) 2 #PEs: 4 1,5 1 0,5 0 0 200 400 600 800 1000 #iterations between sync 1200 • 100 iterations per syncronization is the sweet spot! 3,5 3 Time (S) 2,5 1 2 100 500 1,5 1000 1500 1 0,5 0 4 6 #PEs 9 Results: Running time • Runs with varying # PEs, with 100 iterations per sync. Average over 10 runs. 0,6 Time (S) 0,5 0,4 0,3 0,2 0,1 0 0 20 40 60 80 #PEs 100 120 140 160 Results: Running time • Runs with varying # PEs, with 100 iterations per sync. Average over 10 runs. – Minimum at 64 PEs: 0,112559s. – Little gains > 30 PEs. – Smaller speedup than expected. Results: Speedup • Runs with varying # PEs, with 100 iterations per sync. Average over 10 runs. 6 Speedup 5 4 3 2 1 0 0 20 40 60 80 #PEs 100 120 140 160