Parallelization of the Telemedicine Benchmark for the Xbox 360 Architecture Howard Wong, SURF-IT Fellow Professor Jean-Luc Gaudiot, EECS August 29, 2008 PASCAL: PArallel Systems and Computer Architecture Lab. University of California, Irvine Outline Background (Benchmark, Platform) Current Work Methodology (Compiler, Data Set) Results Conclusions Future Work PASCAL: PArallel Systems & Computer Architecture Lab. Background Work Why Parallel Programming? Advent of everyday multicomputers Ultimate goal: Auto-parallelization Basic concepts − − ? Problems Programming primitives Telemedicine Benchmark Platform – Xbox 360 3 Cores Graphics Engine Vector Processing Core 1 PASCAL: PArallel Systems & Computer Architecture Lab. Core 2 Core n Current Work Goal: Identify the parallelization process Efficiency measured in performance Performance in relation to load POSIX threads (pthreads) and OpenMP Sorting Routines 'fallbackSort' − Making search 'brackets' 'mainSort' − Dependencies between loop iterations PASCAL: PArallel Systems & Computer Architecture Lab. Methodology Compilation Data Sets gcc or g++ version 4.2 Monkey brain image in PPM format Derived data via netpbm Test Platform Xbox 360 with Ubuntu Linux Images courtesy of Neuroscience Center, UC Davis, and Joerg Meyer, Center of GRAVITY, Calit2, UC Irvine. PASCAL: PArallel Systems & Computer Architecture Lab. Initial Results Speedup versus Number of Threads Compression of brains.ppm; Compared to bzip2 3.500 3.000 Speedup 2.500 2.000 pbzip2 bzip2mod Linear 1.500 1.000 0.500 0.000 0 1 2 3 No. of Threads PASCAL: PArallel Systems & Computer Architecture Lab. 4 Analysis Possible thread contention Thread Creation 'bitmap' of data as former optimization Optimized for long runs of 0's or 1's Extra mutex locks required Sorting algorithm called at least 300 times for the large image Thread creation efficiency Thread management structures PASCAL: PArallel Systems & Computer Architecture Lab. Results (Cont’d) Speedup versus Load (pbzip2 - 3 Threads) Speedup versus Load (bzip2mod - 2 Threads) Compared to bzip2; 1/4, 1/2, whole image Compared to bzip2; 1/4, 1/2, whole image 3.050 0.690 0.680 3.000 0.670 Speedup Speedup 2.950 2.900 0.660 0.650 2.850 0.640 2.800 0.000 0.250 0.500 0.750 Fraction of Image Processed 1.000 0.630 0.000 0.250 0.500 0.750 Fraction of Image Processed PASCAL: PArallel Systems & Computer Architecture Lab. 1.000 Conclusions & Discussion Speedup dependent on the load size Possible improvements Use a 'threadpool' Create other important compression functions Examine alternative algorithms with a parallel mindset End result Thread creation Thread management overhead Heavy contention PASCAL: PArallel Systems & Computer Architecture Lab. Questions for Future Work What is the impact of thread creation? Do the other TMB programs have the same features? Can vector instructions improve program performance? Are new, more efficient parallel programming primitives needed for our application? PASCAL: PArallel Systems & Computer Architecture Lab. Acknowledgments Professor Jean-Luc Gaudiot and the PASCAL group UC Davis Neuroscience Center Professor Joerg Meyer, Center of GRAVITY, Calit2 Calit2 UROP PASCAL: PArallel Systems & Computer Architecture Lab.