Slides - Applied parallel Computing

advertisement
18.337: Image Median Filter
Rafael Palacios
Aeronautics and Astronautics department. Visiting professor
(IIT-Institute for Research in Technology, University Pontificia Comillas, Madrid, Spain)
1
MEDIAN FILTER
2
Median Filter
3
Median filter algorithm
• Median filter is a nonlinear operation for noise
reduction (dust or spikes).
• Eliminates noise while preserving edges.
• Assigns to each point the median value of the
neighborhood n*ns log(ns)
• Matlab function:
– C=medfilt2(cn); % 3x3 neighborhood
– C=medfilt2(cn,[r c]); % rxc neighborhood
4
MATRIX PREPARATION
5
Size adjustment
Original image
1024x1600x3
5 MB
2048x3200x3
20 MB
6
Noise added
cn=imnoise(c,'salt & pepper');
7
EXPERIMENTAL RESULTS
8
Sensitivity to Image size
~O(n)
9
Sensitivity to Neighborhood size
Unexpected !
10
Basic experiments
• Original matrix size: 2048x3200x3=20M
• Matrix sizes:
n=[20M, 80M, 320M, 1280M]  x4 steps
• Neighborhood sizes:
nn=[3 5 9 17 33 65];  2^n + 1 neighborhood
• Partitioning strategies:
11
Computer systems
• Dell (Xeon 2.67 GHz 8M L3, 12 GB DDR3 1066MHz)
– Matlab single core
– Matlab parallel toolbox
– Matlab with pMatlab
• Cluster (beagle, beowulf)
– MPI
12
SINGLE-CORE RESULTS
13
Matlab Single-Core
14
PARALLEL COMPUTING TOOLBOX
15
Matlab Multi-Core
• Parallel computing toolbox using ‘spmd’
• Image size=80MB, neighborhood=65
Worker time matches prediction
16
Matlab Multi-Core
• with spmd there is an overhead of 1.5s for the
80MB matrix (transfer rate 200 MB/s)
• There are no memory conflict because each
lab works on its own copy of the image
• Parallelization by rows or columns are
equivalent
17
Matlab Multi-Core
• 8 core computer, slower memory
• 2x Xeon Quad 2.26GHz, 8GB 667MHz
More overhead
18
pMATLAB
19
pMatlab
• Allows to run Matlab in parallel by launching
several Matlab processes that communicate
using MPI
• Communications are transparent to the user,
since pMatlab uses a distributed matrix
approach
How it works
• Several Matlab processes are started
• The leader process loads the image into a
shared matrix
• Each subprocess receives its corresponding
section of the image in X
• Each subprocess applies median filter and
stores results in Y
• The leader process aggregates results
21
Results
• Computing time does not decrease significantly using double.
• It scales well using uint8  less data to be moved
double
uint8
22
Testing remarks
• Initially the pMatlab algorithm was
implemented using 2D double matrices
– Filtering was performed in three steps (R, G, B)
– The conversion to double, involved multiplying by
8 the size of the matrices (affecting
communications)
• The final implementation involved 3D uint8
matrices
23
CONCLUSION
24
Conclusion
• Performance may depend on the algorithm more that
on parallelization. (5x5 neighborhood)
• Matlab’s Parallel Computing Toolbox does not use
shared memory.
• Parallel toolbox uses a lot of memory and
communication, because the whole matrix is
propagated to all clients.
– Algorithm implemented with spmd
– It is possible to use distribute matrices to improve
– It is possible to use sliced variables if parfor loops.
• pMatlab uses memory efficiently.
• MPI version was not developed.
Conclusion
• Speedup comparison
Conclusion
pMatlab using double
pMatlab using uint8
pMatlab (3D uint8) 320MB
This slide shows the effect of data transfer
320MB image matrix
pMatlab
Toolbox
total time speedup
total time
speedup
1 core
138.8
1.0
132
1.0
2 core
71.6
1.9
72.1
1.8
4 core
40.5
3.4
46.1
2.9
•For larger sizes, the impact of latencies is reduced. (computing time and
transmission time are linear with size)
•Speedup is almost perfect in pMatlab, but worst in Toolbox.
•The amount of memory needed to be sent increases asymptotically to 320MB in
the case of pMatlab, however it increases linearly with the number of processors in
the case of Parallel Computing Toolbox.
28
BACKUP SLIDES
29
Parallel computing toolbox:
memory issues
%Activate parallel computing
%matlabpool(4)
…
tic
%Create treads
spmd
c = myfilterP(a,labindex,numlabs);
end
toc
%gather results from treads (inefficient
memory allocation)
result=[];
for ii=1:length( c )
result=[result,c{ii}];
end
toc
spmd(4)
if labindex==1
c = myfilterP(a1);
end
if labindex==2
c = myfilterP(a2);
end
if labindex==3
c = myfilterP(a3);
end
if labindex==4
c = myfilterP(a4);
end
end
%Close parallel computing
%matlabpool close
Same result
30
pMatlab: sending initial data to clients
PARALLEL = 1;
if (PARALLEL)
…
%Create map for XL. The leader process owns all data
X(:,:)=XL; %only leader process has a non-empty X,
mapL=map([1 1],{},0);
% so only leader process writes something to X.
%Writing to X involves sending data to subproceses, since
% different chunks of X belong to different Pids.
%Create map for distributed matrices X and Y. Each
processor gets a set of columns
mapM=map([1 Np],{},0:Np-1);
else
mapL=1;
mapM=1;
end
%Create matrices XL, X and Y
XL=zeros(n,m,mapL); %owned by Pid 0
X=zeros(n,m,mapM); %distributed input
Y=zeros(n,m,mapM); %distributed output
%Get local part in a standard double matrix. It is faster to
work with local matrices.
Xloc=local(X);
%code
%code
Y=put_local(Y,res);
%After obtaining the resulting matrix
res, store it in distributed matrix Y
if Pid==0 %only the main process makes the initialization
load input_matrix
XL(:,:)=a; %all data stored in Pid 0
end
31
pMatlab (double)
computing %
comm %
total time speedup
1 core
34.7 93.8%
2.3
6.2%
37
1.0
2 core
18.2 75.8%
5.8 24.2%
24
1.5
4 core
8.4 52.5%
7.6 47.5%
16
2.3
•More data transfer occur with 4 cores (75% of the matrix) than 2 cores (50% of the
matrix is copied back and forth). Results are consistent.
•Conversions from uint8 to double is penalizing pMatlab tests. The 80MB image
matrix is in fact 630MB in double format.
32
pMatlab (3D uint8)
computing %
comm %
total time speedup
1 core
32.6 98.8%
0.4
1.2%
33
1.0
2 core
17 90.9%
1.7
9.1%
18.7
1.8
4 core
9 81.8%
2 18.2%
11
3.0
•Times are smaller
•Speedup is better because communication delays don’t penalize as much
33
Download