Parallelization of the Telemedicine Benchmark for the Xbox 360

advertisement
Parallelization of the Telemedicine
Benchmark for the Xbox 360
Architecture
Howard Wong, SURF-IT Fellow
Professor Jean-Luc Gaudiot, EECS
August 29, 2008
PASCAL: PArallel Systems and Computer Architecture Lab.
University of California, Irvine
Outline
Background (Benchmark, Platform)
 Current Work
 Methodology (Compiler, Data Set)
 Results
 Conclusions
 Future Work

PASCAL: PArallel Systems & Computer Architecture Lab.
Background
Work

Why Parallel Programming?



Advent of everyday multicomputers
Ultimate goal: Auto-parallelization
Basic concepts
−
−


?
Problems
Programming primitives
Telemedicine Benchmark
Platform – Xbox 360



3 Cores
Graphics Engine
Vector Processing
Core 1
PASCAL: PArallel Systems & Computer Architecture Lab.
Core 2
Core n
Current Work

Goal: Identify the parallelization process


Efficiency measured in performance
Performance in relation to load
POSIX threads (pthreads) and OpenMP
 Sorting Routines


'fallbackSort'
−

Making search 'brackets'
'mainSort'
−
Dependencies between loop iterations
PASCAL: PArallel Systems & Computer Architecture Lab.
Methodology

Compilation


Data Sets



gcc or g++ version 4.2
Monkey brain image in PPM
format
Derived data via netpbm
Test Platform

Xbox 360 with Ubuntu Linux
Images courtesy of Neuroscience
Center, UC Davis, and Joerg Meyer,
Center of GRAVITY, Calit2, UC Irvine.
PASCAL: PArallel Systems & Computer Architecture Lab.
Initial Results
Speedup versus Number of Threads
Compression of brains.ppm; Compared to bzip2
3.500
3.000
Speedup
2.500
2.000
pbzip2
bzip2mod
Linear
1.500
1.000
0.500
0.000
0
1
2
3
No. of Threads
PASCAL: PArallel Systems & Computer Architecture Lab.
4
Analysis

Possible thread contention




Thread Creation



'bitmap' of data as former optimization
Optimized for long runs of 0's or 1's
Extra mutex locks required
Sorting algorithm called at least 300 times for the large
image
Thread creation efficiency
Thread management structures
PASCAL: PArallel Systems & Computer Architecture Lab.
Results (Cont’d)
Speedup versus Load (pbzip2 - 3 Threads)
Speedup versus Load (bzip2mod - 2 Threads)
Compared to bzip2; 1/4, 1/2, whole image
Compared to bzip2; 1/4, 1/2, whole image
3.050
0.690
0.680
3.000
0.670
Speedup
Speedup
2.950
2.900
0.660
0.650
2.850
0.640
2.800
0.000
0.250
0.500
0.750
Fraction of Image Processed
1.000
0.630
0.000
0.250
0.500
0.750
Fraction of Image Processed
PASCAL: PArallel Systems & Computer Architecture Lab.
1.000
Conclusions & Discussion
Speedup dependent on the load size
 Possible improvements





Use a 'threadpool'
Create other important compression functions
Examine alternative algorithms with a parallel
mindset
End result



Thread creation
Thread management overhead
Heavy contention
PASCAL: PArallel Systems & Computer Architecture Lab.
Questions for Future Work
What is the impact of thread creation?
 Do the other TMB programs have the same
features?
 Can vector instructions improve program
performance?
 Are new, more efficient parallel programming
primitives needed for our application?

PASCAL: PArallel Systems & Computer Architecture Lab.
Acknowledgments
Professor Jean-Luc Gaudiot and the PASCAL group
 UC Davis Neuroscience Center
 Professor Joerg Meyer, Center of GRAVITY, Calit2
 Calit2
 UROP

PASCAL: PArallel Systems & Computer Architecture Lab.
Download