ppt - Louisiana State University

advertisement
Prof. Thomas Sterling
Dr. Hartmut Kaiser
Department of Computer Science
Louisiana State University
March 24th , 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
APPLIED PARALLEL ALGORITHMS 4
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
1
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
2
Puzzle of the Day
Duff‘s device: what is going on here?
void copy(char *to, char const *from, int count)
{
int n = (count + 3) / 4;
switch (count % 4) {
case 0:
do {
*to++ = *from++;
'case' defines jump labels only!
case 3:
*to++ = *from++;
case 2:
Missing 'break' makes code 'fall through'
*to++ = *from++;
case 1:
*to++ = *from++;
} while (--n > 0);
}
}
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
3
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
4
Time and Frequency Domain
Representation of Signals
•Two ways of looking at the same signal
Example 1: Time and frequency domain representations of a
sine wave
http://robots.freehostia.com/Radio/Image137.gif
http://www.theparticle.com/cs/bc/mcs/signalnotes.pdf
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
5
Example 2
Time and frequency domain
representations of a 4Hz +
12Hz Sine Wave
http://www.theparticle.com/cs/bc/mcs/signalnotes.pdf
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
6
Fourier Analysis
• Fourier analysis: Represent continuous functions by
potentially infinite series of sine and cosine functions
NOTE: The signal sum is composed from sine and cosine functions
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
http://zone.ni.com/cms/images/devzone/tut/a/8c34be30580.gif
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
7
Fourier Analysis
Nice demo: http://www.imaios.com/en/e-Courses/e-MRI/image-formation/Fourier-transform
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
8
Fourier Representation of Square
Wave
• Spectrum extends to infinity
• As we move from left to right on the frequency axis amplitude(of
components) decreases monotonically
http://www.engr.colostate.edu/~dga/mechatronics/figures/4-5.gif
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
9
Fourier Representation of Square Wave
• Synthesis of a square wave(of zero DC component) from
its frequency domain components
• Ideal square wave is represented by the thick black line
http://mathworld.wolfram.com/FourierSeriesSquareWave.html
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
10
Fourier Representation of Square Wave
Nice demo: http://www.imaios.com/en/e-Courses/e-MRI/image-formation/Fourier-transform
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
11
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
12
Digital Signals
• Digital signal: A digital signal is a signal that is both
discrete and quantized
• Digital signals can be obtained by sampling analog
signals
• The figure represents an analog to digital converter that
does sampling and quantization
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
13
Digital Signal Processing
• Processing of digital signals with the help of a computer
Continuous
Input
A/D Converter
Digital Signal
Processing
Continuous
Output
D/A Converter
http://www.ece.rochester.edu/courses/ECE446/Introduction%20to%20Digital%20Signal%20Processing.pdf
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
14
Advantages of Digital Signal
Processing
•
Digital system can be simply reprogrammed for other applications / ported
to different hardware / duplicated (Reconfiguring analog system means
hardware redesign, testing, verification)
•
DSP provides better control of accuracy requirements (Analog system
depends on strict components tolerance, response may drift with
temperature)
•
Digital signals can be easily stored without deterioration (Analog signals are
not easily transportable and often can’t be processed off-line)
•
More sophisticated signal processing algorithms can be implemented
(Difficult to perform precise mathematical operations in analog form)
Adapted from http://www-sigproc.eng.cam.ac.uk/~op205/3F3_1_Introduction_to_DSP.pdf
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
15
Why use Discrete Fourier Transform?
• Digital Signal Processing applications often require
mapping of data in the time domain to its frequency
domain counterparts
• Many applications in science, engineering
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
16
Example 1
• Spectrogram of Speech Signal
NOTE: Spectrogram is a 3D representation of signal amplitude vs
time and frequency
http://ccrma.stanford.edu/~jos/st/Spectrogram_Speech.html
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
17
Example 2
•Removing blemishes of a photograph
To filter an image in the frequency domain:
1.
Compute F(u,v) the DFT of the image
2.
Multiply F(u,v) by a filter function H(u,v)
3.
Compute the inverse DFT of the result
DFT is used for converting
image data in the spatial (2D)
domain to the frequency
domain before filtering and
for conversion back to spatial
domain afterwards
Output of different Gaussian low
pass filters for removing blemishes
Adapted from www.comp.dit.ie/bmacnamee/materials/dip/lectures/ImageProcessing7-FrequencyFiltering.ppt
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
18
Discrete Fourier Transform(Qualitative)
• Discrete Fourier transform: Map a sequence over time to
another sequence over frequency
– Signal strength as a function of time 
– Fourier coefficients as a function of frequency
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
19
DFT Example (1/4)
16 data points representing signal strength over time
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
20
DFT Example (2/4)
DFT yields amplitudes and frequencies of sine/cosine functions
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
21
DFT Example (3/4)
Plot of four constituent sine/cosine functions and their sum
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
22
DFT Example (4/4)
Continuous function and original 16 samples
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
23
Formal Definition of DFT
• DFT of a discrete signal x[n] of N sample points is
defined as
N 1
X [k ]   x[n]   nk ,   e
2i
N
for
0k  N
n 0
• Direct implementation of this equation requires N
complex additions and multiplications
2
NOTE: DFT of an N point sequence gives N points in the
transform domain
http://cas.ensmp.fr/~chaplais/wavetour_presentation/transformees/Fourier/FFTUS.html
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
24
Formal Definition of DFT
• Complex plane, relation of different powers of ω
for N = 8
N 1
X [k ]   x[n]   ,   e
nk
im
83  e
84  e
4
3
 e
2
8
2i
8
2
2i
8
n 0
2i
1
8
80  e
0,0
85  e
0k  N
81  e
2i
8
5
2i
N
 e
7
8
2i
8
 e
6
8
6
2i
8
7
0
2i
8
re
2i
8
N 1

2i
N
1
nk
x[n]   X [k ]   ,   e
N k 0
0n N
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
25
Computing DFT
• Writing the previous definition of DFT in matrix form
• Matrix-vector product X
= Fn x
N 1
X [k ]   x[n]   nk ,   e
n 0
– x is input vector (signal samples)
2i
N
0k  N
– Each element of Fn
fi,j = nij for 0  i, j < n and n is primitive nth
root of unity
– X is output vector (discrete Fourier coefficients)
NOTE: n is a complex number defined as
e
2i
N
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
26
Example 1
How to compute the DFT of a vector having two elements?
• Example Vector: (2, 3)
• 2, the primitive square root of unity, is -1
 200 201  x0  1 1  2   5 
 10
   











1

1


x
1

1
3

1
2  1  
   
 2
 200 201  X 0  1 1 1  5   2 
 10











 2 1  1  1  3 
11 

X


   
1
2 
 2
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
27
Example 2
How to compute the DFT of a vector having four elements?
• Example Vector:(1, 2, 4, 3)
• The primitive 4th root of unity is i
 40 40 40 40  x0  1 1 1 1  1   10 
 0
  
  

1
2
3
 4 4 4 4  x1  1 i  1  i  2    3  i 
  0  2  4  6  x   1  1 1  1 4    0 
4
4
4  2 
 4

  







0
3
6
9


     x
1

i

1
i
3

3

i

  

4
4
4  3 
 4
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
28
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
29
Why Fast Fourier Transform(FFT)?
•
Reduce the computational operations required
•
•
Straightforward implementation: (n2)
Fast Fourier transform: (n log n)
- (n log n) << (n2) for large values of n
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
30
Fast Fourier Transform
Fourier matrix FN can be decomposed into half size Fourier matrices FN/2 :
 I DN / 2  FN / 2

FN  
 I  DN / 2  0
I : Identitym atrix
P : perm utation m atrix
1 0 0

0  0
DN   0 0  2


 
0 0 0

0 
P 
FN / 2 
0 


0 

0 


0 
0  N 1 

PN : Row reordering,
first even rows, then odd
Example (N = 4):
1
1  1 1
1 1

 
1
2
3
1




1 i
4
4
4 

1  2  4  6  1 i 2
4
4
4

 
1  3  6  9  1 i 3

4
4
4 

1
i2
i4
i6
1  1
 
i3   0

i6   1
 
9
i   0
0  1 1

1 0
i  1  1
0  1 0  0 0

1 0  i  0 0
0
1
0  0

0 0  0
1 1  1

1  1 0
0
1 0 0

0 0 1
0 0 0

0 1 0 
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
31
Fast Fourier Transform
• Based on divide-and-conquer strategy
• We want to compute f(x), a polynomial of degree n (n is
power of 2) at the n complex nth roots of unity
• We define two new functions, f [0] and f [1]
f ( x)  a0 a1x a2 x2  ... an1xn1
f [ 0] ( x)  a0  a2 x  a4 x2  ...  an2 xn / 21
f [1] ( x)  a1  a3 x  a5 x2  ...  an1 xn / 21
x  x 2
f ( x)  f [0] ( x 2 )  x  f [1] ( x 2 )
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
32
FFT (Cont…)
• Problem of evaluating f (x) at n values of  reduces to
a) Evaluating f [0](x) and f [1](x) at n/2 values of 
That is, computing f(x) at points
n0 , n1 , n2 , ... , nn1
becomes evaluating f [0] & f [1] at
(n0 )2 , (n1 )2 , (n2 )2 , ... , (nn / 21 )2
b) Performing f [0](x2) + x f [1](x2)
• Leads to recursive algorithm with time complexity
(n log n)
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
33
Recursive Sequential Implementation
of FFT
Recursive_FFT(a,n)
Parameter
Local
n
a[0……(n-1)]
n

a [0]
a[1]
y
y [0]
y[1]
Number of elements in a
Coefficients
Primitive nth root of unity
Evaluate polynomial at this point
Even numbered coefficients
Odd numbered coefficients
Result of transform
Result of FFT of a [0]
Result of FFT of a [1]
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
34
Recursive Sequential Implementation
of FFT (Cont…)
if n=1 then
return a
else
2i
n e n
1
a [0] (a[0],a[2],….,a[n-2])
a [1] (a[1],a[3],….,a[n-1])
y [0]  Recursive_FFT(a [0],n/2)
y [1]  Recursive_FFT(a [1],n/2)
for k0 to n/2 -1 do
y[k] y [0] [k]+* y [1] [k]
y[k+n/2] y [0] [k]- * y [1] [k]
* n
end for
return y
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
endif
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
35
Iterative Implementation Preferable
• Well-written iterative version performs fewer index
computations than recursive version
• Iterative version evaluates key common subexpression only once
• Easier to derive parallel FFT algorithm when
sequential algorithm in iterative form
Adapted from slides(and text) of Parallel Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
36
Recursive  Iterative (1/4)
We now discuss the derivation of an iterative algorithm starting with the recursive one
• Each rounded rectangle indicates an fft
function call
• The function goes on dividing the vector
into half until a scalar is obtained
(NOTE: DFT of a scalar is the scalar itself)
Recursive implementation of FFT for the
input sequence (1,2,4,3) is shown below
(10,-3-i,0,-3+i)
• The values returned as result of each function
call is indicated on the curved arrows
fft(1,2,4,3)
(5,-3)
(5,-1)
fft(1,4)
fft(2,3)
(1)
(4)
fft(1)
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
fft(4)
(2)
(3)
fft(2)
fft(3)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
Recursive  Iterative (2/4)
• Determining which computations are performed for each function invocation
• For each rounded rectangle, the computation is of the form
x+y(z)
x-y(z)
which corresponds to the following statements of the recursive algorithm
y[k] y [0] [k]+* y [1] [k]
y[k+n/2] y [0] [k]- * y [1] [k]
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
5+1(5) -3+i(-1)
5-1(5) -3-i(-1)
(5, -3)
(5, -1)
1+1(4) 1-1(4)
1
4
2+1(3) 2-1(3)
2
3
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
38
Recursive  Iterative (3/4)
• This diagram tracks the propagation of data values (input vector at the bottom
and FFT output at the top)
• Permutation stage: Index i of the input vector is replaced by rev(i), where
rev(i) is the binary value of i read in the reverse order (00=>00, 01=>10,
10=>01, 11=>11)
0
10
-3-i
-3+i
5+1*5
5
1+1*4
1
1
-3+i*(-1)
-3-i*(-1)
-3
5
-1
1-1*4
2+1*3
2-1*3
4
2
4
2
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
1
5-1*5
2
4
3
3
3
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
39
Recursive  Iterative (4/4)
• Initially, the scalars are simply forwarded upwards as the DFT
of a scalar is the scalar itself
• For other stages, computation of the output is performed
using two values forwarded from the previous stage
• The arrows depicting data flow form butterfly patterns
• An iterative algorithm can be deduced from the previous
diagram
• The computation represented in each row (excluding the
bottommost row) corresponds to one iteration of the algorithm
• Hence log(n) iterations should be performed (log(4)=2 in the
previous example)
• For each iteration the algorithm modifies the value of every
index (here n indices)
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
40
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
41
Stages of Parallel Program Design
• Partition
– Divide problem into tasks
• Communicate
– Determine amount and pattern of
communication
• Agglomerate
– Combine tasks
• Map
– Assign agglomerated tasks to
processors
• Efficiency analysis
Adapted from http://nereida.deioc.ull.es/html/openmp/minnesotatutorial/content_openMP.html
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
42
Parallel FFT Program Design
• Domain decomposition
– Associate primitive task with each element of input vector a and
corresponding element of output vector y
• Add channels to handle communications between tasks
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
43
FFT Task/Channel Graph (n=8)
•Long rounded rectangles represent
tasks and arrows indicate
communication between processes
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
44
FFT Task/Channel Graph (n=8) Cont…
Steps:
•Permute vector as follows
(000=>000, 001=>100, …,
110=>011, 111=>111)
•Perform log(n) iterations (log(8)=3)
- stage 1 completed after iteration 1
- stage 2 completed after iteration 2
- stage 3 completed after iteration 3
(Vector y after stage 3 gives the
output)
NOTE: Vector y will contain the
intermediate results of
stage 1 and stage 2
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
stage 1
stage 2
stage 3
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
45
Diagrammatic Representation of
Profiling Results
Conventions:
C
represents a function compute (args) that accepts the propagated
values and performs the following computation (refer slide 33)
x+y(z)
x-y(z)
S
represents the MPI_Send(args) command
R
represents the MPI_Receive(args) command
P
represents the function permute(args) which is basically
permute(args)
{
………
MPI_Send(args)
………
}
represents the time for which the process is idle
http://www.cs.uoregon.edu/research/paracomp/tau/tauprofile/images/petsc/
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
46
Diagrammatic Representation of
Profiling Results
Permutation
Phase
P0
P1
P
R
P2
Stage 1
Stage 2
Stage 3
S
R
C
S
R
C
S
R
C
y[0]
R
S
C
S
R
C
S
R
C
y[1]
S
R
C
R
S
C
S
R
C
NOTE: The diagram
is oversimplified to
y[2]
enhance understanding
of butterfly diagram
P3
P
R
R
S
C
R
S
C
S
R
C
y[3]
P4
R
P
S
R
C
S
R
C
R
S
C
y[4]
R
S
C
S
R
C
R
S
C
y[5]
S
R
C
R
S
C
R
S
C
y[6]
R
S
C
R
S
C
R
S
C
y[7]
P5
P6
P7
R
P
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
47
Agglomeration and Mapping
• Agglomerate primitive tasks associated with contiguous
elements of vector to reduce communication
• Map one agglomerated task to each process
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
48
After Agglomeration, Mapping
Input
In general, an n point FFT can be
implemented on a multicomputer
supporting p processes
In this case, n=16 and p=4.
a[0], a[1], a[2], a[3]  process 1
a[4], a[5], a[6], a[7]  process 2
and so on
Output
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
49
Phases of Parallel FFT Algorithm
• Phase 1: Processes permute a’s (all-to-all
communication)
• Phase 2:
– First log n – log p iterations of FFT
– No message passing is required
• Phase 3:
– Final log p iterations
– Processes organized as logical hypercube
– In each iteration every process swaps values with
partner across a hypercube dimension
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
50
Computation Complexity Analysis
• Each process performs equal share of computation
– Sequential complexity: Θ(n log n)
• Hence the complexity of parallel implementation is
Θ(n log n / p)
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
Communication Complexity Analysis
• A maximum of ceil(n / p) elements of the vector
associated with a process
• In the all to all communication stage, every process
swaps about n/p values with its counterpart
– Time complexity: Θ(n/p log p)
• A total of log p iterations that need communication with
other processes (average n/p swaps)
– Time complexity: Θ(n/p log p)
• Hence the total communication complexity of parallel
implementation is
Θ(n/p log p)
Adapted from slides(and text) of Parallel
Programming in C with MPI and
OpenMP by Michael Quinn
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
52
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
53
Parallel Sorting
• Finding a permutation of a sequence [a1, a2, ...an-1], such
that a1 <= a2 <= … an-1
• Often we sort records based on key
• Parallel sort results in:
– Partial sequences are sorted on all nodes
– Largest value on node N-1 is smaller or equal to smallest value
on node N
• Several ways to parallelize
– Chunk sequence, sort locally, merge back (bubblesort)
– Project algorithm structure onto cmmunication and distribution
scheme (quicksort)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
54
Bubble Sort
•
The bubble sort is the oldest and simplest sort in use. Unfortunately, it's also the
slowest.
• The bubble sort works by comparing each item in the list with the item next to it,
and swapping them if required.
• The algorithm repeats this process until it makes a pass all the way through the
list without swapping any items (in other words, all items are in the correct order).
• This causes larger values to "bubble" to the end of the list while smaller values
"sink" towards the beginning of the list.
The bubble sort is generally considered to be the most inefficient sorting algorithm in
common usage. Under best-case conditions (the list is already sorted), the bubble
sort can approach a constant O(n) level of complexity. General-case is O(n2).
Pros: Simplicity and ease of implementation.
Cons: Extremely inefficient.
Reference
http://math.hws.edu/TMCM/java/xSortLab/
Source
http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/sorting/bubblesort.c
http://www.sci.hkbu.edu.hk
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
55
Bubblesort
void sort(int *v, int n)
{
int i, j;
for(i = n-2; i >= 0; i--)
for(j = 0; j <= i; j++)
if(v[j] > v[j+1])
swap(v[j], v[j+1]);
}
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
56
Bubblesort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
57
Discussion
•
•
Bubble sort takes time proportional to N*N/2 for N data items
This parallelization splits N data items into N/P so time on one of the P
processors now proportional to (N/P*N/P)/2
– i.e. have reduced time by a factor of P*P!
•
Bubble sort is much slower than quick sort!
– Better to run quick sort on single processor than bubble sort on many processors!
http://www.sci.hkbu.edu.hk
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
58
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
59
Merge Sort
•
•
•
•
The merge sort splits the list to be sorted into two equal halves, and places them in
separate arrays.
Each array is recursively sorted, and then merged back together to form the final
sorted list.
Like most recursive sorts, the merge sort has an algorithmic complexity of O(n log n).
Elementary implementations of the merge sort make use of three arrays - one for
each half of the data set and one to store the sorted list in. The below algorithm
merges the arrays in-place, so only two arrays are required. There are non-recursive
versions of the merge sort, but they don't yield any significant performance
enhancement over the recursive algorithm on most machines.
Pros: Marginally faster than the heap sort for larger sets.
Cons: At least twice the memory requirements of the other sorts; recursive.
Reference
http://math.hws.edu/TMCM/java/xSortLab/
Source
http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/sorting/mergesort.c
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
60
Merge Sort
[cdekate@celeritas sort]$ mpiexec -np 4 ./mergesort
1000000; 4 processors; 0.250000 secs
[cdekate@celeritas sort]$
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
61
Mergesort
void msort(int *A, int min, int max)
{
int *C;
/* dummy, just to fit the function */
int mid = (min+max)/2;
int lowerCount = mid - min + 1;
int upperCount = max - mid;
/* If the range consists of a single element, it's already sorted */
if (max == min) {
return;
} else {
/* Otherwise, sort the first half */
sort(A, min, mid);
/* Now sort the second half */
sort(A, mid+1, max);
/* Now merge the two halves */
C = merge(A + min, lowerCount, A + mid + 1, upperCount);
}
}
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
62
Mergesort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
63
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
64
Heap Sort
• The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge
and quick sorts it doesn't require massive recursion or multiple arrays to work. This
makes it the most attractive option for very large data sets of millions of items.
• The heap sort works as it name suggests
1.
2.
3.
4.
It begins by building a heap out of the data set,
Then removing the largest item and placing it at the end of the sorted array.
After removing the largest item, it reconstructs the heap and removes the largest remaining
item and places it in the next open position from the end of the sorted array.
This is repeated until there are no items left in the heap and the sorted array is full.
Elementary implementations require two arrays - one to hold the heap and the other to hold
the sorted elements.
To do an in-place sort and save the space the second array would require, the
algorithm below "cheats" by using the same array to store both the heap and the
sorted array. Whenever an item is removed from the heap, it frees up a space at
the end of the array that the removed item can be placed in.
Pros: In-place and non-recursive, making it a good choice for extremely large data
sets.
Cons: Slower than the merge and quick sorts.
Reference
http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/heapsort.html
Source
http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/heapsort/heapsort.c
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
65
Heapsort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
66
Topics
Fourier Transforms
• Fourier analysis
• Discrete Fourier transform
• Fast Fourier transform
• Parallel Implementation
Parallel Sorting
• Bubble Sort
• Merge Sort
• Heap Sort
• Quick Sort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
67
Quick Sort
•
•
The quick sort is an in-place, divide-and-conquer, massively recursive sort.
Divide and Conquer Algorithms
–
•
Algorithms that solve (conquer) problems by dividing them into smaller subproblems until the problem is so small that it is trivially solved.
In Place
–
•
In place sorting algorithms don't require additional temporary space to store
elements as they sort; they use the space originally occupied by the elements.
Quicksort takes time proportional to (worst case) N*N for N data items, usually
n log n, but most of the time much faster
–
•
Constant communication cost – 2*N data items
–
•
for 1,000,000 must send/receive 2*1,000,000 from/to root
In general, processing/communication proportional to N*log2N/2*N = log2N/2
–
•
for 1,000,000 items, Nlog2N ~ 1,000,000*20
so for 1,000,000 items, only 20/2 =10 times as much processing as communication
Suggests can only get speedup, with this parallelization, for very large N
Reference
http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html
Source
http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/qsort/qsort.c
http://www.sci.hkbu.edu.hk
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
68
Quick Sort
• The recursive algorithm consists of four steps (which closely resemble the
merge sort):
1. If there are one or less elements in the array to be sorted, return immediately.
2. Pick an element in the array to serve as a "pivot" point. (Usually the left-most
element in the array is used.)
3. Split the array into two parts - one with elements larger than the pivot and the
other with elements smaller than the pivot.
4. Recursively repeat the algorithm for both halves of the original array.
•
•
•
The efficiency of the algorithm is majorly impacted by which element is
chosen as the pivot point.
The worst-case efficiency of the quick sort, O(n2), occurs when the list is
sorted and the left-most element is chosen.
If the data to be sorted isn't random, randomly choosing a pivot point is
recommended. As long as the pivot point is chosen randomly, the quick sort
has an algorithmic complexity of O(n log n).
Pros: Extremely fast.
Cons: Very complex algorithm, massively recursive
http://www.sci.hkbu.edu.hk
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
69
Quicksort
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
70
Summary : Material for the Test
• Discrete Fourier Transform:
Slides 24-26
• Fast Fourier Transform (FFT): Slides 30-40
• Parallel FFT:
Slides 41-52
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
71
CSC 7600 Lecture 18: Applied Parallel Algorithms 4
Spring 2011
72
Download