Big Iron and Parallel Processing USArray Data Processing Workshop Original by:

advertisement
Big Iron and Parallel Processing
USArray Data Processing Workshop
Original by:
Scott Teige, PhD, IU Information Technology Support
Modified for 2010 course by G Pavlis
June 28, 2016
Overview
•
•
•
•
•
•
•
•
How big is “Big Iron”?
Where is it, what is it?
One system, the details
Parallelism, the way forward
Scaling and what it means to you
Programming techniques
Examples
Excercises
USArray Data Processing Workshop
June 28, 2016
What is the TeraGrid?
• “… a nationally distributed
cyberinfrastructure that provides leading
edge computational and data services for
scientific discovery through research and
education…”
• One of several consortiums for high
performance computing supported by the
NSF
USArray Data Processing Workshop
June 28, 2016
Some TeraGrid Systems
Kraken
NICS
Cray
608 TF
128 TB
Ranger
TACC
Sun
579
123
Abe
NCSA
Dell
89
9.4
Lonestar
TACC
Dell
62
11.6
Steele
Purdue
Dell
60
12.4
Queen Bee LONI
Dell
50
5.3
Lincoln
NCSA
Dell
47
3.0
BigRed
IU
IBM
30
6.0
USArray Data Processing Workshop
June 28, 2016
System Layout
Kraken
2.30 GHz
66048 cores
Ranger
2.66
62976
Abe
2.33
9600
Lonestar
2.66
5840
Steele
2.33
7144
USArray Data Processing Workshop
June 28, 2016
Availability
Kraken
608TFLOPS 96% Use
24.3 IdleTF
Ranger
579
91%
52.2
Abe
89
90%
8.9
Lonestar
62
92%
5.0
Steele
60
67%
19.8
Queen Bee 51
95%
2.5
Lincoln
48
4%
45.6
Big Red
31
83%
5.2
USArray Data Processing Workshop
June 28, 2016
IU Research Cyberinfrastructure
The Big Picture:
• Compute
Big Red (IBM e1350 Blade Center JS21)
Quarry (IBM e1350 Blade Center HS21)
• Storage
HPSS
GPFS
OpenAFS
Lustre
Lustre/WAN
USArray Data Processing Workshop
June 28, 2016
High Performance Systems
• Big Red
30 TFLOPS IBM JS21 SuSE Cluster
768 blades/3072 cores: 2.5 GHz PPC 970MP
8GB Memory, 4 cores per blade
Myrinet 2000
LoadLeveler & Moab
• Quarry
7 TFLOPS IBM HS21 RHEL Cluster
140 blades/1120 cores: 2.0 GHz Intel Xeon 5335
8GB Memory, 8 cores per blade
1Gb Ethernet (upgrading to 10Gb)
PBS (Torque) & Moab
USArray Data Processing Workshop
June 28, 2016
Data Capacitor (AKA Lustre)
High Performance Parallel File system
-ca 1.2PB spinning disk
-local and WAN capabilities
SC07 Bandwidth Challenge Winner
-moved 18.2 Gbps across a single
10Gbps link
Dark side: likes large files, performs badly
on large numbers of files and for simple
commands like “ls” on a directory
USArray Data Processing Workshop
June 28, 2016
HPSS
•
•
•
•
High Performance Storage System
ca. 3 PB tape storage
75 TB front-side disk cache
Ability to mirror data between IUPUI and
IUB campuses
USArray Data Processing Workshop
June 28, 2016
Practical points
• If you are doing serious data processing NSF
cyberinfrastructure systems have major
advantages
 State of the art compute servers
 Large capacity data storage
 Archival storage for data backup
• Dark side:
 Shared resource
 Have to work through remote sysadmins
 Commercial software (e.g. matlab) can be a
issue
USArray Data Processing Workshop
June 28, 2016
Parallel processing
• Why it matters
 Single CPU systems are reaching their limit
 Multiple CPU desktops are the norm already
 All current HPC = parallel processing
• Dark side
• Still requires manual coding changes (i.e. not
yet common for code to automatically be
parallel)
• Lots of added complexity
USArray Data Processing Workshop
June 28, 2016
Serial vs. Parallel
• Calculation
• Flow Control
• I/O
USArray Data Processing Workshop
•
•
•
•
•
Calculation
Flow Control
I/O
Synchronization
Communication
June 28, 2016
1-F
1-F
A Serial
Program
F/N
Amdahl’s Law:
F
S=1/(1-F+F/N)
Special case, F=1
S=N, Ideal Scaling
USArray Data Processing Workshop
June 28, 2016
Speed for various scaling rules
“Paralyzable Process”
S=Ne -(N-1)/q
“Superlinear Scaling”
S>N
USArray Data Processing Workshop
June 28, 2016
Architectures
• Shared memory
 These imacs are shared memory machines
with 2 processors
 Each cpu can address the same RAM
• Distributed memory
 Blades(nodes)=motherboard in a rack
 Each blade has it’s own RAM
 Clusters have fast network to link nodes
• All modern HPC systems are both (each blade
uses multicore processor)
USArray Data Processing Workshop
June 28, 2016
Current technologies
• Threads
Low level functionality
Good for raw speed on desktop
Mainly for the hard core nerd like me
So will say no more today
• OpenMP
• MPI
USArray Data Processing Workshop
June 28, 2016
MPI vs. OpenMP
• MPI code may
execute across many
nodes
• Entire program is
replicated for each
core (sections may or
may not execute)
• Variables not shared
• Typically requires
structural
modification to code
USArray Data Processing Workshop
• OpenMP code
executes only on the
set of cores sharing
memory
• Simplified interface to
pthreads
• Sections of code may
be parallel or serial
• Variables may be
shared
• Incremental
parallelization is easy
June 28, 2016
Let’s look first at OpenMP
• Who has heard of the old fashioned “fork”
procedure (part of unix since 1970s)?
• What is a “thread” then and how is it
different from a fork?
• OpenMP is a simple, clean way to spawn
and manage a collection of threads
USArray Data Processing Workshop
June 28, 2016
OPENMP Getting Started Exercise
Preliminaries:
In terminal window cd to test directory
Export OMP_NUM_THREADS=8
icc omp_hello.c –openmp –o hello
Run her:
./hello
Fork
Look at the source code together and discuss
Run a variant:
export OMP_NUM_THREADS=20
./hello
USArray Data Processing Workshop
…
Join
June 28, 2016
PROGRAM DOT_PRODUCT
INTEGER N, CHUNKSIZE, CHUNK, I
PARAMETER (N=100)
PARAMETER (CHUNKSIZE=10)
REAL A(N), B(N), RESULT
!
You can even use this in, yes,
FORTRAN
Some initializations
DO I = 1, N
A(I) = I * 1.0
B(I) = I * 2.0
ENDDO
RESULT= 0.0
CHUNK = CHUNKSIZE
!$OMP PARALLEL DO
!$OMP& DEFAULT(SHARED) PRIVATE(I)
!$OMP& SCHEDULE(STATIC,CHUNK)
!$OMP& REDUCTION(+:RESULT)
DO I = 1, N
RESULT = RESULT + (A(I) * B(I))
ENDDO
!$OMP END PARALLEL DO NOWAIT
Fork
…
Join
PRINT *, 'Final Result= ', RESULT
END
USArray Data Processing Workshop
June 28, 2016
Some basic issues in parallel
codes
• Synchronization
Are tasks of each thread balanced?
Tie up CPU if it is waiting for other
threads to exit
• Shared memory means two threads can
try to alter the same data
Traditional threads use a mutex
OpenMP uses a simpler method (hang
on – next slides)
USArray Data Processing Workshop
June 28, 2016
OpenMP Synchronization
Constructs
• MASTER: block executed only by master
thread
• CRITICAL: block executed by one thread
at a time
• BARRIER: each thread waits until all
threads reach the barrier
• ORDERED: block executed sequentially
by threads
USArray Data Processing Workshop
June 28, 2016
Data Scope Attribute Clauses
• SHARED: variable is shared across all
threads – mutex automatically created
• PRIVATE: variable is replicated in each
thread (not protected by a mutex – faster)
• DEFAULT: change the default scoping of
all variables in a region
USArray Data Processing Workshop
June 28, 2016
Some Useful Library routines
•
•
•
•
•
omp_set_num_threads(integer)
omp_get_num_threads()
omp_get_max_threads()
omp_get_thread_num()
Others are implementation dependent
USArray Data Processing Workshop
June 28, 2016
OpenMP Advice
•
•
•
•
Always explicitly scope variables
Never branch into/out of a parallel region
Never put a barrier in an if block
Avoid i/o in a parallel loop (nearly
guarantees a load imbalance)
USArray Data Processing Workshop
June 28, 2016
Exercise 2: OpenMP
•
•
•
•
The example programs are in ~/OMP_F_examples or
~/OMP_C_examples
Go to https://computing.llnl.gov/tutorials/openMP/
Skip to step 4, compiler is “icc” or “ifort”
Work on this until I call end
USArray Data Processing Workshop
June 28, 2016
Next topic: MPI
• MPI=Message Passing Interface
• Can be used on a multicore CPU, but
main application is for multiple nodes
• Next slide is source code for mpi hello
world program we’ll run in a minute
USArray Data Processing Workshop
6/28/2016
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int myrank;
int ntasks;
Node 1
Node 2 …
int main(int argc, char **argv)
{
/* Initialize MPI */
MPI_Init(&argc, &argv);
/* get number of workers */
MPI_Comm_size(MPI_COMM_WORLD, &ntasks);
/* Find out my identity in the default communicator
each task gets a unique rank between 0 and ntasks-1 */
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
…
…
MPI_Barrier(MPI_COMM_WORLD);
fprintf(stdout,"Hello from MPI_BABY=%d\n",myrank);
MPI_Finalize();
exit(0);
}
USArray Data Processing Workshop
June 28, 2016
Running mpi_baby
cp –r /N/dc/scratch/usarray/MPI .
mpicc mpi_baby.c –o mpi_baby
mpirun –np 8 mpi_baby
mpirun –np 32 –machinefile my_list mpi_baby
USArray Data Processing Workshop
June 28, 2016
C AUTHOR: Blaise Barney
From the man page:
program scatter
include 'mpif.h'
MPI_Scatter - Sends data from one task
integer SIZE
to all tasks in a group
parameter(SIZE=4)
…
integer numtasks, rank, sendcount, recvcount, source, ierr
message is split into n equal segments,
real*4 sendbuf(SIZE,SIZE), recvbuf(SIZE)
the ith segment is sent to the ith process in the group
C Fortran stores this array in column major order, so the
C scatter will actually scatter columns, not rows.
data sendbuf /1.0, 2.0, 3.0, 4.0,
&
5.0, 6.0, 7.0, 8.0,
&
9.0, 10.0, 11.0, 12.0,
&
13.0, 14.0, 15.0, 16.0 /
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
if (numtasks .eq. SIZE) then
source = 1
sendcount = SIZE
recvcount = SIZE
call MPI_SCATTER(sendbuf, sendcount, MPI_REAL, recvbuf,
&
recvcount, MPI_REAL, source, MPI_COMM_WORLD, ierr)
print *, 'rank= ',rank,' Results: ',recvbuf
else
print *, 'Must specify',SIZE,' processors. Terminating.'
endif
call MPI_FINALIZE(ierr)
end
USArray Data Processing Workshop
June 28, 2016
Some linux tricks to get more information:
man -w MPI
ls /N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/share/man/man3
MPI_Abort
MPI_Allgather
MPI_Allreduce
MPI_Alltoall
...
MPI_Wait
MPI_Waitall
MPI_Waitany
MPI_Waitsome
mpicc --showme
/N/soft/linux-rhel4-x86_64/intel/cce/10.1.022/bin/icc \
-I/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/include \
-pthread -L/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/lib \
-lmpi -lopen-rte -lopen-pal -ltorque -lnuma -ldl \
-Wl,--export-dynamic -lnsl -lutil -ldl -Wl,-rpath -Wl,/usr/lib64
USArray Data Processing Workshop
June 28, 2016
MPI Advice
• Never put a barrier in an if block
• Use care with non-blocking
communication, things can pile up fast
USArray Data Processing Workshop
June 28, 2016
So, can I use MPI with OpenMP?
• Yes you can; extreme care is advised
• Some implementations of MPI forbid it
• You can get killed by “oversubscription” real fast,
I’ve (Scott) seen time increase like N2
• But sometimes you must… some fftw libraries
are OMP multithreaded, for example.
• As things are going this caution likely to
disappear
USArray Data Processing Workshop
June 28, 2016
Exercise: MPI
•
•
•
•
•
•
•
•
Examples are in ~/MPI_F_examples or ~/MPI_C_examples
Go to https://computing.llnl.gov/tutorials/mpi/
Skip to step 6. MPI compilers are “mpif90” and “mpicc”, normal
(serial) compilers are “ifort” and “icc”.
Compile your code: “make all” (Overrides section 9)
To run an mpi code: “mpirun –np 8 <exe>” …or…
“mpirun –np 16 –machinefile <ask me> <exe>”
Skip section 12
There is no evaluation form.
USArray Data Processing Workshop
June 28, 2016
Where were those again?
• https://computing.llnl.gov/tutorials/openMP/excercise.html
• https://computing.llnl.gov/tutorials/mpi/exercise.html
USArray Data Processing Workshop
June 28, 2016
Acknowledgements
•
•
•
•
This material is based upon work supported by the National Science
Foundation under Grant Numbers 0116050 and 0521433. Any opinions,
findings and conclusions or recommendations expressed in this material are
those of the authors and do not necessarily reflect the views of the National
Science Foundation (NSF).
This work was support in part by the Indiana Metabolomics and Cytomics
Initiative (METACyt). METACyt is supported in part by Lilly Endowment, Inc.
This work was support in part by the Indiana Genomics Initiative. The Indiana
Genomics Initiative of Indiana University is supported in part by Lilly
Endowment, Inc.
This work was supported in part by Shared University Research grants from
IBM, Inc. to Indiana University.
USArray Data Processing Workshop
June 28, 2016
Download