x - Boston University

advertisement
Tuning MATLAB for Better
Performance
Kadin Tseng
Scientific Computing and Visualization, IS&T
Boston University
Fall 2010
1
Topics Covered
1. Performance Issues
1.1 Memory Allocations
1.2 Vector Representations
1.3 Compiler
1.4 Other Considerations
2. Multiprocessing MATLAB
Fall 2010
2
1. Performance Issues
1.1
1.2
1.3
1.4
Fall 2010
Memory Access
Vector Representations
Compiler
Other Considerations
3
1.1 Memory Access
Memory access patterns often affect computational
performance. Some effective ways to enhance
performance in MATLAB :
 Allocate array memory before using it
 For-loops Ordering
 Compute and save array in-place where applicable.
Fall 2010
4
MATLAB Memory Allocation Issues
• MATLAB arrays are allocated in contiguous address
space ( for efficiency, as dictated by Lapack).
• Arrays allocated on-the-fly. Problematic in a large
for-loop.
• MATLAB is both pass-by-value and pass-byreference (“lazy copy”).
Fall 2010
5
How Does MATLAB Allocate Arrays ?
• MATLAB arrays are allocated in contiguous
address space.
Memory Array
Address element
Without pre-allocation
for i=1:4
x(i) = i;
end
Fall 2010
1
x(1)
…
...
2000
x(1)
2001
x(2)
2002
x(1)
2003
x(2)
2004
x(3)
...
...
10004
x(1)
10005
x(2)
10006
x(3)
10007
x(4)
6
How … Arrays ? Examples
• MATLAB arrays are allocated in contiguous
address space.
n=5000;
tic
for i=1:n
x(i) = i^2;
end
toc
Wallclock time = 0.00046 seconds
n=5000; x = zeros(n,1);
tic
for i=1:n
x(i) = i^2;
end
toc
Wallclock time = 0.00004 seconds
The timing data are recorded on Katana. The actual
times may vary depending on the processor.
Fall 2010
7
Passing Arrays Into A Function
MATLAB uses pass-by-reference if passed array is
used as is; a copy will be made if the array is
modified. MATLAB calls it “lazy copy.”
function y = mycopy(A, x, b, change)
If change, A(2,3) = 23; end
% change forces a local copy of a
y = A*x + b;
% use x and b directly from calling program
pause(2)
% keep memory longer to see it
On Windows, can use Task Manager to monitor
memory allocation history.
>> n = 5000; A = rand(n); x = rand(n,1); b = rand(n,1);
>> y = mycopy(A, x, b, 0);
>> y = mycopy(A, x, b, 1);
Fall 2010
8
For-loop Ordering
 Best if inner-most loop is for array left-most
index, etc.
 For a multi-dimensional array, x(i,j), the 1D
representation of the same array, x(k), follows
column-wise order and inherently possesses the
contiguous property
n=5000; x = zeros(n);
for i=1:n
% rows
for j=1:n % columns
x(i,j) = i+(j-1)*n;
end
end
n=5000; x = zeros(n);
for j=1:n
% columns
for i=1:n % rows
x(i,j) = i+(j-1)*n;
end
end
Wallclock time = 0.88 seconds
Wallclock time = 0.48 seconds
Fall 2010
for i=1:n*n
x(i) = i;
end
x = 1:n*n;
9
Compute In-place
 Compute and save array in-place improves
performance
x = rand(5000);
tic
y = x.^2;
toc
x = rand(5000);
tic
x = x.^2;
toc
Wallclock time = 0.30 seconds
Wallclock time = 0.11 seconds
Caveat:
May not be worthwhile if it involves data type or
size change …
Fall 2010
10
Other Considerations
 Generally, better to use function instead of script
m-file
 Script m-file is loaded into memory and evaluate one line
at a time. Subsequent uses require reloading.
 Function m-file is compiled into a pseudo-code and is
loaded on first application. Subsequent uses of the
function will be faster without reloading.
 Function is modular; self cleaning; reusable.
 Global variables are expensive; difficult to track.
 Physical memory is much faster than virtual mem.
 Avoid passing large matrices to a function and
modifying only a handful of elements. (struc and
cell are exceptions)
Fall 2010
11
Other Considerations (cont’d)
 load and save are efficient to handle whole data
file; textscan is more memory-efficient to extract
text meeting specific criteria.
 Don’t reassign array that results in change of data
type or shape.
 Limit m-files size and complexity.
 Computationally intensive jobs often require large
memory …
 Structure of array more memory-efficient than
array of structures.
Fall 2010
12
Memory Management
Maximize memory availability.
 32-bit systems < 2 or 3 GB
 64-bit systems running 32-bit MATLAB < 4GB
 64-bit systems running 64-bit MATLAB < 8TB
(16GB on some Katana nodes)
 Minimize memory usage.
(Details to follow …)

Fall 2010
13
Minimize Memory Usage


Fall 2010
Use clear, pack or other memory saving means when
possible. If double precision (default) is not required, use
of ‘single’ data type could save substantial amount of
memory. For example,
>> x=ones(10,'single'); y=x+1; % y inherits single from x
Use sparse to save memory
>> n=5000; A = zeros(n); A(3,2) = 1; B = ones(n);
>> C = A*B;
>> As = sparse(A);
>> Cs = As*B; % it can save time for low density
>> A2 = sparse(n,n); A2(3,2) = 1;
14
Minimize Memory Usage (Cont’d)
Use “matlab –nojvm …” saves lots of memory.
Memory usage query
For Unix:
>> unix('ps aux | grep $USER | grep –m 1 MATLAB | …
awk ''{print $5“k”}''')
% only “k” double quoted
Katana% top
For Windows:
>> m = feature('memstats'); % largest contiguous free block
Use MS Windows Task Manager to monitor memory allocation.
 Distribute memory among multiprocessors via MATLAB
Parallel Computing Toolbox.


Fall 2010
15
Special Functions for Real Numbers
MATLAB provides a few functions for processing real,
noncomplex, data specifically. These functions are more
efficient than their generic versions:
 realpow – power for real numbers
 realsqrt – square root for real numbers
 reallog – logarithm for real numbers
 realmin/realmax – min/max for real numbers
n = 1000; x = 1:n;
x = x.^2;
tic
x = sqrt(x);
toc
n = 1000; x = 1:n;
x = x.^2;
tic
x = realsqrt(x);
toc
Wallclock time = 0.00022 seconds
Wallclock time = 0.00004 seconds
• isreal reports whether the array is real
• single/double converts data to single-/double-precision
Fall 2010
16
Vectorization
 MATLAB is designed for vector and matrix
operations. The use of for loop, in general, can be
expensive, especially if the loop count is large or
nested.
 Without array pre-allocation, for-loops are very
costly.
 From a performance standpoint, in general, vector
representation should be used in place of forloops whenever reasonable.
i = 0;
for t = 0:.01:100
i = i + 1;
y(i) = sin(t);
end
t = 0:.01:100;
y = sin(t);
Wallclock time = 0.1069 seconds
Wallclock time = 0.0007 seconds
Fall 2010
17
Vector Manipulations of Arrays
>> A = magic(3) % define a 3x3 matrix A
A=
8
1
6
3
5
7
4
9
2
>> B = A^2;
% B = A * A;
>> C = A + B;
>> b = 1:3 % define b as a 1x3 row vector
b=
1
2
3
>> [A, b'] % Add b transpose as a 4th column to A
ans =
8
1
6
1
3
5
7
2
4
9
2
3
Fall 2010
18
Vector Manipulations of Arrays
>> [A; b] % Add b as a 4th row to A
ans =
8
1
6
3
5
7
4
9
2
1
2
3
>> A = zeros(3) % zeros generates 3*3 array of 0’s
A=
0
0
0
0
0
0
0
0
0
>> B = 2*ones(2,3) % ones generates 2 * 3 array of 1’s
B=
2
2
2
2
2
2
Alternatively,
>> B = repmat(2,2,3) % matrix replication
Fall 2010
19
Vector Manipulations of Arrays
>> y = (1:5)’;
>> n = 3;
>> B = y(:, ones(1,n))
% B = y(:, [1 1 1]) or B=[y y y]
B=
1
1
1
2
2
2
3
3
3
4
4
4
5
5
5
Again, B can be generated via repmat as
>> B = repmat(y, 1, 3);
Fall 2010
20
Vector Manipulations of Arrays
>> A = magic(3)
A=
8
1
6
3
5
7
4
9
2
>> B = A(:, [1 3 2]) % switch 2nd and third columns of A
B=
8
6
1
3
7
5
4
2
9
>> A(:, 2) = [ ] % delete second column of A
A=
8
6
3
7
4
2
Fall 2010
21
Vector Representation Example
n = 3000; x = zeros(n);
for j=1:n
for i=1:n
x(i,j) = i+(j-1)*n;
x(i,j) = x(i,j)^2;
end
end
n = 3000;
i = (1:n) ';
I = repmat(i,1,n); % replicate along j
x = I + (I'-1)*n;
x = x.^2;
Wallclock time = 0.19 seconds
Wallclock time = 0.49 seconds
Notes on the vector form of the computations :
•
To eliminate the for-loops, all values of i and j must be made available at
once. The ones or repmat utilities can be used to replicate rows or columns.
In this special case, J = I' is used to save computations and memory.
•
Often, there are trade-offs between efficiency and memory using the vector
form. Here, the creation of I adds to the memory and compute time.
However, as more works can be leveraged against I, efficiency improves.
•
Added memory could push memory usage closer to the physical ram limit.
Once virtual memory (swap space) is required, performance degrades.
Fall 2010
22
Functions Useful in Vectorizing
Fall 2010
Function
Description
all
Test to see if all elements are of a prescribed value
any
Test to see if any element is of a prescribed value
zeros
Create array of zeroes
ones
Create array of ones
repmat
Replicate and tile an array
find
Find indices and values of nonzero elements
diff
Find differences and approximate derivatives
squeeze
Remove singleton dimensions from an array
prod
Find product of array elements
sum
Find the sum of array elements
cumsum
Find cumulative sum
shiftdim
Shift array dimensions
logical
Convert numeric values to logical
sort
Sort array elements in ascending /descending order
23
Laplace Equation
Laplace Equation:
u u
 2 0
2
x y
2
2
(1)
Boundary Conditions:
u ( x,0)  sin (x)
0 x1
u ( x,1)  sin (x)e
u (0, y)  u (1, y)  0
x
0 x1
(2)
0 y1
Analytical solution:
u( x, y)  sin (x)e  xy
Fall 2010
0  x  1; 0  y  1
(3)
24
Discrete Laplace Equation
Discretize Equation (1) by centered-difference yields:
uin, 1
j 
uin1,j  uin1,j  ui,jn 1  ui,jn 1
4
i  1,2, ,m; j  1,2, ,m
(4)
where n and n+1 denote the current and the next time step,
respectively, while
ui,jn  u n(xi ,y j )
i  0,1,2, ,m  1; j  0,1,2, ,m  1
(5)
 u n(ix, jy )
For simplicity, we take
x   y 
Fall 2010
1
m1
25
Computational Domain
y, j
u(x,1)  sin(x)e x
x, i
u(1,y)  0
u(0,y)  0
u(x,0)  sin(x)
uin, j 1

Fall 2010
n
n
uin1,j  uin1,j  ui,j
 1  ui,j 1
4
i  1,2,  ,m;
j  1,2,  ,m
26
Five-point Finite-Difference Stencil
Interior cells.
x
Where solution of the Laplace
equation is sought.
x o x
x
(i, j)
Exterior cells.
Green cells denote cells where
homogeneous boundary
conditions are imposed while
non-homogeneous boundary
conditions are colored in blue.
Fall 2010
x
x o x
x
27
SOR Update Function
How do you vectorize it ?
1. Remove the for loops
2. Define i = ib:2:ie;
3. Define j = jb:2:je;
4. Use sum for del
% original code fragment
jb = 2; je = n+1; ib = 3; ie = m+1;
for i=ib:2:ie
for j=jb:2:je
up = ( u(i ,j+1) + u(i+1,j ) + ...
u(i-1,j ) + u(i ,j-1) )*0.25;
u(i,j) = (1.0 - omega)*u(i,j) +omega*up;
del = del + abs(up-u(i,j));
end
end
% vector code fragment
jb = 2; je = n+1; ib = 3; ie = m+1;
i = ib:2:ie; j = jb:2:je;
up = ( u(i ,j+1) + u(i+1,j ) + ...
u(i-1,j ) + u(i ,j-1) )*0.25;
u(i,j) = (1.0 - omega)*u(i,j) + omega*up;
del = del + sum(sum(abs(up-u(i,j))));
Fall 2010
More efficient way ?
28
Solution Contour Plot
Fall 2010
29
SOR Update Function (4)
m
Matrix size
Fall 2010
Wallclock
ssor2Dij
Wallclock
ssor2Dji
Wallclock
ssor2Dv
for loops reverse loops
vector
128
1.01
0.98
0.26
256
8.07
7.64
1.60
512
65.81
60.49
11.27
1024
594.91
495.92
189.05
30
Integration Example
• An integration of the cosine function between 0 and π/2
• Integration scheme is mid-point rule for simplicity.
• Several parallel methods will be demonstrated.
p
b
n
 cos(x)dx   
a
i 1 j 1
Worker 1
aij  h
aij
 n

h
 cos(aij  2 )h
cos(x)dx 

i 1 
 j 1
p

mid-point of increment
Worker 2
cos(x)
h
Worker 3
a = 0; b = pi/2; % range
m = 8; % # of increments
h = (b-a)/m; % increment
p = numlabs;
n = m/p; % inc. / worker
ai = a + (i-1)*n*h;
aij = ai + (j-1)*h;
Worker 4
x=a
Fall 2010
x=b
31
Integration Example — the Kernel
n
function intOut = Integral(fct, a, b, n)
%function intOut = Integral(fct, a, b, n)
% performs mid-point rule integration of "fct"
% fct -- integrand (cos, sin, etc.)
% a -- starting point of the range of integration
% b –- end point of the range of integration
% n -- number of increments
% Usage example: >> Integral(@cos, 0, pi/2, 500)

j 1
cos(aij  h2 )h
% 0 to pi/2
h = (b – a)/n;
% increment length
intOut = 0.0;
% initialize integral
for j=1:n
% sum integrals
aij = a +(j-1)*h;
% function is evaluated at mid-interval
intOut = intOut + fct(aij+h/2)*h;
end
Vector form of the function:
function intOut = Integral(fct, a, b, n)
h = (b – a)/n;
aij = a + (0:n-1)*h;
intOut = sum(fct(aij+h/2))*h;
Fall 2010
32
Integration Example — Serial Integration
% serial integration
tic
m = 10000;
a = 0;
b = pi*0.5;
intSerial = Integral(@cos, a, b, m);
toc
Fall 2010
33
Integration Example Benchmarks
increment
For loop
vector
10000
0.011
0.0002
20000
0.022
0.0004
40000
0.044
0.0008
80000
0.087
0.0016
160000
0.1734
0.0033
320000
0.3488
0.0069
 Timings (seconds) obtained on a quad-core Xeon X5570
 Computation linearly proportional to # of increments.
 FORTRAN and C timings are an order of magnitude faster.
Fall 2010
34
Compiler
A MATLAB compiler, mcc, is available.
 It compiles m-files into C codes, object libraries, or
stand-alone executables.
 A stand-alone executable generated with mcc can
run on compatible platforms without an installed
MATLAB or a MATLAB license.
 Many MATLAB general and toolbox licenses are
available. Infrequently, MATLAB access may be
denied if all licenses are checked out. Running a
stand-alone requires NO licenses and no waiting.
Fall 2010
35
Compiler (Cont’d)
 Some compiled codes may run more efficiently than
m-files because they are not run in interpretive
mode.
 A stand-alone enables you to share it without
revealing the source.
www.bu.edu/tech/research/training/tutorials/matlab/vector/miscs/compiler/
Fall 2010
36
Compiler (Cont’d)
How to build a standalone executable
>> mcc –o ssor2Dijc –m ssor2Dij
How to run ssor2Dijc on Katana
Katana% run_ssor2Dijc.sh /usr/local/apps/matlab_2009b 256 256
Input arguments
MATLAB root
Details:
• The m-file is ssor2Dij.m
• Input arguments to code are processed as strings by mcc. Convert
with str2num if need be. ssor2Dij.m requires 2 inputs; m, n
if isdeployed, m=str2num(m); end
• Output cannot be returned; either save to file or display on screen.
• The executable is ssor2Dijc
• run_ssor2Dijc.sh is the run script generated by mcc.
• None of SOR codes benefits, in runtime, from mcc.
Fall 2010
37
MATLAB Programming Tools
 profile - profiler to identify “hot spots” for
performance enhancement.
 mlint - for inconsistencies and suspicious
constructs in M-files.
 debug - MATLAB debugger.
 guide - Graphical User Interface design tool.
Fall 2010
38
MATLAB profiler
To use profile viewer, DONOT start MATLAB with –nojvm option
>> profile on –detail 'builtin' –timer 'real'
>> % run code to be
Turns on profiler. –detail 'builtin'
>> % profiled here
enables MATLAB builtin functions;
>> %
-timer 'real' reports wallclock time.
>> %
>> profile viewer % view profiling data
>> profile off
% turn off profiler
Profiling example.
>> profile on
>> ssor2Dij % profiling the SOR Laplace solver
>> profile viewer
>> profile off
Fall 2010
39
How to Save Profiling Data
Two ways to save profiling data:
1. Save into a directory of HTML files
Viewing is static, i.e., the profiling data
displayed correspond to a prescribed set of
options. View with a browser.
2. Saved as a MAT file
Viewing is dynamic; you can change the options.
Must be viewed in the MATLAB environment.
Fall 2010
40
Profiling – save as HTML files
Viewing is static, i.e., the profiling data displayed
correspond to a prescribed set of options. View with
a browser.
>> profile on
>> plot(magic(20))
>> profile viewer
>> p = profile('info');
>> profsave(p, ‘my_profile') % html files in my_profile dir
Fall 2010
41
Profiling – save as MAT file
Viewing is dynamic; you can change the options.
Must be viewed in the MATLAB environment.
>>
>>
>>
>>
>>
>>
>>
>>
Fall 2010
profile on
plot(magic(20))
profile viewer
p = profile('info');
save myprofiledata p
clear p
load myprofiledata
profview(0,p)
42
MATLAB “grammar checker”
 mlint is used to identify coding
inconsistencies and make coding
performance improvement
recommendations.
 mlint is a standalone utility; it is an option in
profile.
 MATLAB editor provides this feature.
 Debug mode can also be invoked through
editor.
Fall 2010
43
Running MATLAB in Command
Line Mode and Batch
matlab -nodisplay –nosplash –r myfile
Add –nojvm if Java is not needed to save memory
For batch jobs on Katana, use the above command in the
batch script.
Visit http://www.bu.edu/tech/research/computation/linuxcluster/katana-cluster/runningjobs/ for instructions on
running batch jobs.
Fall 2010
44
Comment Out Block Of Statements
Often in code debugging, one wants to comment out an
entire block of statements. Three convenient ways to do it:
1. %{
n = 3000;
x = rand(n);
%}
2. Select the statement block with the mouse, then press
the control key along with the key “r”, MATLAB prepends
the “%” key to each line of the selected block. Uncomment
by first select the statement block followed by Ctrl “t”.
3. if 0
n = 3000;
x = rand(n);
end
Fall 2010
45
Useful SCV Info
Please help us do better in the future by participating in a
quick survey:
http://scv.bu.edu/survey/fall10tut_survey.html
• SCV home page (http://scv.bu.edu/)
• Resource Applications (https://acct.bu.edu/SCF)
• Help
– Web-based tutorials (http://scv.bu.edu/)
(MPI, OpenMP, MATLAB, IDL, Graphics tools)
– HPC consultations by appointment
• Kadin Tseng (kadin@bu.edu)
• Doug Sondak (sondak@bu.edu)
– help@twister.bu.edu, help@cootie.bu.edu
Fall 2010
46
Download