Computing on Mio Introduction

advertisement
Computing on Mio
Introduction
Timothy H. Kaiser, Ph.D.
tkaiser@mines.edu
Director - CSM High Performance Computing
Director - Golden Energy Computing Organization
http://inside.mines.edu/mio/tutorial/
1
Purpose
To enable people to quickly be make use of the
high performance computing resource
Mio.mines.edu
To give people an overview of high performance
computing so that they might know what is and
is not possible
Get people comfortable using other high
performance computing resources
2
Topics
•
•
Introduction to High Performance Computing
•
•
•
Purpose and methodology
Hardware
Programming
Mio overview
•
•
•
•
History
General and Node descriptions
File systems
Scheduling
3
More Topics
•
Running on Mio
•
•
•
•
•
•
Documentation
Batch scripting
Interactive use
Staging data
Enabling Graphics sessions
Useful commands
4
Miscellaneous Topics
•
•
•
Some tricks from MPI and OpenMP
Libraries
Nvidia node
5
High Performance
Computing
6
What is Parallelism?
•
Consider your favorite computational application
•
•
One processor can give me results in N hours
Why not use N processors
-- and get the results in just one hour?
The concept is simple:
Parallelism = applying multiple processors
to a single problem
7
Parallel computing is
computing by committee
•
Parallel computing: the use of multiple
computers or processors working
together on a common task.
•
•
Grid of a Problem
to be Solved
Process 0
Process 1
does work
does work
for this region for this region
Each processor works on its
section of the problem
Processors are allowed to exchange
information with other processors
Process 2
Process 3
does work
does work
for this region for this region
8
Why do parallel computing?
•
•
Limits of single CPU computing
•
•
Available memory
Performance
Parallel computing allows:
•
•
Solve problems that don’t fit on a single CPU
Solve problems that can’t be solved in a
reasonable time
9
Why do parallel computing?
•
We can run…
•
•
•
•
•
Larger problems
Faster
More cases
Run simulations at finer resolutions
Model physical phenomena more realistically
10
Weather Forecasting
•Atmosphere
is modeled by dividing it into three-dimensional regions
or cells
•1 mile x 1 mile x 1 mile (10 cells high)
•about 500 x 10 6 cells.
•The calculations of each cell are repeated many times to model the
passage of time.
•About 200 floating point operations per cell per time step or 10 11
floating point operations necessary per time step
•10 day forecast with 10 minute resolution => 1.5x1014 flop
•100 Mflops would take about 17 days
•1.7 Tflops would take 2 minutes
•17 Tflops would take 12 seconds
11
Modeling Motion of Astronomical
bodies
(brute force)
•
•
Each body is attracted to each other body by gravitational forces.
•
For N bodies, N - 1 forces / body yields N 2 calculations each time step
Movement of each body can be predicted by calculating the total force experienced by
the body.
•
•
•
A galaxy has, 10 11 stars => 10 9 years for one iteration
Using a N log N efficient approximate algorithm => about a year
NOTE: This is closely related to another hot topic: Protein Folding
12
Types of parallelism two extremes
•
•
Data parallel
•
•
•
Each processor performs the same task on different data
Example - grid problems
Bag of Tasks or Embarrassingly Parallel is a special case
Task parallel
•
•
Each processor performs a different task
•
Pipeline is a special case
Example - signal processing such as encoding multitrack
data
13
Simple data parallel program
•
Example: integrate 2-D propagation problem
Starting partial
differential equation:
Finite Difference
Approximation:
PE #1
PE #2
PE #3
PE #4
PE #5
PE #6
PE #7
y
PE #0
x
14
Typical Task Parallel Application
DATA
Normalize
Task
•
FFT
Task
Multiply
Task
Inverse
FFT
Task
Signal processing
•
•
•
Use one processor for each task
Can use more processors if one is overloaded
This is a pipeline
15
Parallel Program Structure
Communicate &
Repeat
work 1a
Begin
start
parallel
work 2a
work (N)a
work 1b
work 2b
work (N)b
work 1c
work 2c
work (N)c
work 1d
work 2d
16
work (N)d
End
Parallel
End
Parallel Problems
Communicate &
Repeat
work 1a
work 1b
Begin
start
parallel
work 2a
work 2b
work 1c
work 1d
work (N)a
work (N)b
work 2c
work 2d
End
Parallel
work (N)c
work (N)d
Subtasks don’t
finish together
End Serial Section
Start Serial Section
start
parallel
Serial Section
(No Parallel Work)
work (N)x
work 1x
work 2x
work 1y
work 2y
work (N)y
work 1z
work 2z
work (N)z
Not using all processors
17
End
Parallel
End
•
•
•
•
Theoretical upper limits
All parallel programs contain:
•
•
Parallel sections
Serial sections
Serial sections are when work is being duplicated
or no useful work is being done, (waiting for
others)
Serial sections limit the parallel effectiveness
•
If you have a lot of serial computation then you
will not get good speedup
•
No serial work “allows” perfect speedup
Amdahl’s Law states this formally
18
Amdahl’s Law
•
Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple
processors.
•
Effect of multiple processors on run time
t p = (f p /N + fs) t s
•
Effect of multiple processors on speed up
S = ts
•
Where
•
•
•
•
1
=
tp
f p /N + fs
Fs = serial fraction of code
Fp = parallel fraction of code
N = number of processors
Perfect speedup t=t1/n or S(n)=n
19
Illustration of Amdahl's Law
It takes only a small
fraction of serial
content in a code to
degrade the parallel
performance.
20
Amdahl’s Law Vs. Reality
•Amdahl’s Law provides a theoretical upper limit on
parallel speedup assuming that there are no costs for
communications.
•In reality, communications will result in a further
degradation of performance
80
fp = 0.99
70
60
50
Amdahl's Law
Reality
40
30
20
10
0
0
50
100
150
Number of processors
21
200
250
Hardware
22
Some Classes of machines
Network
Processor
Processor
Processor
Processor
Memory
Memory
Memory
Memory
Distributed Memory
Processors only Have access to
their local memory
“talk” to other processors over a network
23
Some Classes of machines
Uniform
Shared
Memory
(UMA)
All processors
have equal access
to
Memory
Can “talk”
via memory
Processor
Processor
Processor
Processor
Memory
Processor
Processor
Processor
Processor
24
Some Classes of machines
Hybrid
Shared memory nodes
connected by a network
...
25
Some Classes of machines
More common today
Each node has a collection
of multicore chips
...
Mio has 48 nodes
32 have 8 cores/node (4/socket)
16 have 12 cores/node (6/socket)
26
Some Classes of machines
Hybrid Machines
•Add special purpose
"Normal" CPU
processors to normal
processors
•Not a new concept but,
regaining traction
•Example: our Tesla Nvidia
node, cuda1
Special Purpose
Processor
FPGA, GPU, Vector,
Cell...
27
Programming
•
•
•
Message passing
Threads/Shared Memory
Manual
28
Message Passing
•
•
•
•
Most common method for large scale applications
Usually the same program running on many cores
Processes communicate by passing messages
Let’s look at a silly example...
29
Message Passing
•
•
•
Message Passing Interface (MPI) library is often
used
•
Works on shared and distributed memory
machines
•
Each process is given a number (0 to N-1)
Most parallel machines ship with MPI
Can be build from source
30
Fortran MPI Program
program send_receive
include "mpif.h"
integer myid,ierr,numprocs,tag,source,destination,count
integer buffer
integer status(MPI_STATUS_SIZE)
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
tag=1234
count=1
if(myid .eq. 0)then
buffer=5678
Call MPI_Send(buffer, count, MPI_INTEGER,1,&
tag, MPI_COMM_WORLD, ierr)
write(*,*)"processor ",myid," sent ",buffer
endif
if(myid .eq. 1)then
Call MPI_Recv(buffer, count, MPI_INTEGER,0,&
tag, MPI_COMM_WORLD, status,ierr)
write(*,*)"processor ",myid," got ",buffer
endif
call MPI_FINALIZE(ierr)
stop
end
31
C MPI Program
#include <stdio.h>
#include "mpi.h"
int main(int argc,char *argv[])
{
int myid, numprocs, tag,source,destination,count, buffer;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
tag=1234;
count=1;
if(myid == 0){
buffer=5678;
MPI_Send(&buffer,count,MPI_INT,1,tag,MPI_COMM_WORLD);
printf("processor %d sent %d\n",myid,buffer);
}
if(myid == 1){
MPI_Recv(&buffer,count,MPI_INT,0,tag,MPI_COMM_WORLD,&status);
printf("processor %d got %d\n",myid,buffer);
}
MPI_Finalize();
}
32
Threads and Shared Memory
•
Threads are sometimes called light weight
processes
•
•
•
•
•
There is some mapping of threads to cores
•
•
Often 1-1 mapping
GPU can have many threads for each core
Threads share memory on a multicore machine
Cheap to start, sleep, wake
Threads can move between cores
Communicate via shared memory
33
Threads Programming
•
•
Fork join model
•
•
In theory can automatically be done by compilers
Pthreads is a threading package for manually
writing threaded programs
Semiautomatic Directive driven threading
•
Directives tell the compiler where it can
multithread
•
OpenMP is the most common directive set
It is possible to combine MPI and OpenMP in a single program
34
Fortran OpenMP
!$OMP PARALLEL DO
do i=1,maxthreads
seed(i)=i+i**2
errcode(i)=vslnewstream(stream(i), brng, seed(i) )
!
write(*,*)i,errcode(i)
enddo
...
...
...
!$OMP PARALLEL DO PRIVATE(twod)
do i=1,nrays
twod=>tarf(:,:,i)
j=omp_get_thread_num()+1
errcode(j)=vdrnggaussian( method, stream(j), msize*msize, twod, a,sigma)
!
write(*,*)"sub array ",i,j,errcode(j),twod(1,1)
enddo
...
...
...
!$OMP PARALLEL DO PRIVATE(twod)
do i=1,nrays
twod=>tarf(:,:,i)
!
write(*,*)twod
call my_clock(cnt1(i))
CALL DGESV( N, NRHS, twod, LDA, IPIVs(:,i), Bs(:,i), LDB, INFOs(i) )
call my_clock(cnt2(i))
write(*,'(i5,i5,3(f12.3))')i,infos(i),cnt2(i),cnt1(i),real(cnt2(i)-cnt1(i),b8)
enddo
35
C OpenMP
#pragma omp parallel sections
{
#pragma omp section
{
! ! ! system_clock(&t1_start);
! ! ! over(m1,n);
! ! ! system_clock(&t1_end);
! ! ! t1_start=t1_start-t0_start;
! ! ! t1_end=t1_end-t0_start;
}
#pragma omp section
{
! ! ! system_clock(&t2_start);
! ! ! over(m2,n);
! ! ! system_clock(&t2_end);
! ! ! t2_start=t2_start-t0_start;
! ! ! t2_end=t2_end-t0_start;
...
}
36
A different thread
and core will be
assigned to run
each section
“Manual” Parallel Programming
•
•
Log on to a machine with N cores
•
•
Usually the N tasks are independent
•
Madagascar
Launch N a separate tasks, putting each in the
background
Each task launched could actually be a pipeline of
processes or multithreaded
37
Golden Energy
Computing Organization
GECO - a quick overview
38
GECO
•
•
Golden Energy Computing Organization
•
GECO’s main computational resource is RA
Center to be a “national hub for computational
inquiries aimed at the discovery of new ways to
meet the energy needs of our society.”
39
GECO HPC Hardware/Software: “Ra”
• Architecture
• Dell with Intel quad-core, dual-socket system
• 2144 processing cores in 268 nodes
256 nodes with 512 Clovertown E5355 (2.67 GHz)
(quad core dual socket) 184 with 16 Gbytes & 72 with 32 Gbytes
12 nodes with 48 Xeon 7140M (3.4 GHz)
(quad socket dual core) 32 Gbytes each
• Memory
• 5,632 Gbytes ram (5.6 terabytes)
• 300 terabyte disk
• 300 terabyte tape back up
• 16/32 gigabytes RAM per node
Center for Technology and
Learning Media
• Performance
• 18 Tflop sustained performance
• 23 Tflop peak
• like every human on the planet doing 2500 calculations
per second
40
41
RA problems
•
•
•
Only for Energy Science
•
Little student access
Allocation is by proposal
People wanted quick access and RA is heavily
loaded
42
Mio.Mines.edu
MIO
Mio.Mines.Edu
It’s All
Mine
43
Mio.mines.edu
•
New concept in HPC for CSM
•
•
School puts up the money for infrastructure
Researchers purchase individual nodes
•
•
They own the nodes
Can use other’s when they are not in use
•
Started with the head node and compute 4 nodes, 32
cores
•
Current (10/28/10) 48 nodes, 472 cores 5.53 Tflop
44
Original Mio Compute Nodes
•
•
Penguin
•
2 x Dual Intel Xeon X5570 Quad Core 2.93GHz 8MB max
RAM speed 1333MHz
•
•
•
up to 2 x 48GB DDR3-800 REG, ECC (24 x 4GB)
2 x 160GB, SATA, 7200RPM
2 x Intel Xeon Dual Socket Motherboard with Integrated
Infiniband DDR/CX4 Connections
Half the size of RA nodes
•
•
•
More efficient in power and computation
More memory
Faster Clock
Have since gone to 12 core nodes and 1 Tbytes disk
45
Mio @ Oct. 28, 2010
Total Installed: 472 cores 5.53 Tflop
46
Node Summary
Node
Cores
RAM GB
0
8
48
DISK Free
(/scratch)
GB
128
1
8
48
128
2
8
48
128
3
8
48
128
4
8
192
128
5
8
48
128
6 to 31
8
24
128
32 to 47
12
24
849
47
Mio: Installing a 200 Tbyte
File system
•
•
•
•
•
Panasas system
•
Available in parallel to all nodes
Used by BP
Donated to CSM
Panasas gave us a one shot “deal” on software
Was working but we need to rebuild after a switch
upgrade
48
File Systems on Mio
•
Four partitions
•
$HOME - should be kept very small, having only start up
scripts and other simple scripts
•
$DATA - Should contain programs you have built for use in
your research, small data sets and run scripts.
•
$SCRATCH - The main area for running applications.
Output from parallel runs should be done to this directory.
•
$SETS - This area is for large data sets, mainly read only, that
will be use over a number of months.
49
What’s in a Name?
•
•
The name Mio is a play on words.
•
The phrase “The computer is mine.” can be
translated as “El ordenador es mío.”
It is a Spanish translation of the word “mine” as in
belongs to me, not the hole in the ground.
50
Mio.mines.edu
Compute Nodes
Tux
Panasas
file
system
Network
switch
Head Node
51
Mio Links
•
•
•
•
http://mio.mines.edu/ganglia/
•
“Picture of usage”
http://inside.mines.edu/mio
•
Documentation
http://mio.mines.edu/inuse/
•
Nodes in use
http://mio.mines.edu/jobs/
•
Jobs running
52
Getting on to Mio
•
•
Address mio.mines.edu
•
Must use ssh client
Only available on campus, or using VPN, or
through another on campus machine (RA)
•
•
Unix and OSX ssh is built in
•
OSX - /HD/Applications/Utilities/Terminal
Windows use putty (Installed on Mines
machines)
•
All Applications\putty
53
ssh usage
•
•
•
Strongly suggest that you set up and use ssh keys
Nontrivial pass phrase and a connection agent so
that you only need to type the pass phrase once
every 4 hours
•
•
Good security
Easy access
See http://geco.mines.edu/ssh/
54
ssh usage
•
For today
•
Passwords on Mio.mines.edu are the same as your Mines
MultiPass password, that is, it is the same as the standard
unix boxes on campus such as imagine or illuminate.
•
If you don't have a MultiPass password or need to reset it see
http://newuser.mines.edu/password
55
...
...
Creating
Creating
Creating
Creating
Creating
Creating
You may see:
directory
directory
directory
directory
directory
directory
First login
'/u/is/az/joeminer'.
'/u/is/az/joeminer/.mozilla'.
'/u/is/az/joeminer/.mozilla/extensions'.
'/u/is/az/joeminer/.mozilla/plugins'.
'/u/is/az/joeminer/.kde'.
'/u/is/az/joeminer/.kde/Autostart'.
It doesn't appear that you have set up your ssh key.
This process will make the files:
/u/is/az/joeminer/.ssh/id_rsa.pub
/u/is/az/joeminer/.ssh/id_rsa
/u/is/az/joeminer/.ssh/authorized_keys
Generating public/private rsa key pair.
Enter file in which to save the key (/u/is/az/joeminer/.ssh/id_rsa):
Created directory '/u/is/az/joeminer/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /u/is/az/joeminer/.ssh/id_rsa.
Your public key has been saved in /u/is/az/joeminer/.ssh/id_rsa.pub.
The key fingerprint is:
55:bf:89:4d:03:4d:da:21:84:a9:e8:b2:b9:3a:bd:fe joeminer@mio.mines.edu
[joeminer@mio ~]$
Hit return until you see the final prompt
56
Mio Environment
•
When you ssh to Mio you are on the head node
•
•
Shared resource
•
You have access to basic Unix commands
DO NOT run parallel, compute intensive, or
memory intensive jobs on the head node
•
•
Editors, file commands, gcc...
Need to set up your environment to gain
access to HPC compilers
57
Mio HPC Environment
•
•
•
You want access to parallel and high
performance compilers
•
Intel, AMD, mpicc, mpif90
Commands to run and monitor parallel
programs
See: http://inside.mines.edu/mio/page3.html
•
•
Gives details
Shortcut
58
Shortcut Setup
http://inside.mines.edu/mio/page3.html
Add the following to you .bashrc file and log out/in
if [ -f
/usr/local/bin/setup/setup ]; then
source /usr/local/bin/setup/setup intel ; fi
•Openmpi parallel MPI environment
•Compilers
•MPI compilers
•Intel 11.x
•MPI run commands
•AMD
•Portland Group
•Python 2.6.5 and 3.1.2
•NAG
Should give you:
[tkaiser@mio ~]$ which mpicc
/opt/lib/openmpi/1.4.2/intel/11.1/bin/mpicc
[tkaiser@mio 59~]$
Getting the Examples
[tkaiser@mio
[tkaiser@mio
[tkaiser@mio
[tkaiser@mio
...
...
tests]$ cd $DATA
tests]$ mkdir cwp
tests]$ cd cwp
cwp]$ wget http://inside.mines.edu/mio/tutorial/cwp.tar
2011-01-11 13:31:06 (48.6 MB/s) - `cwp.tar' saved [20480/20480]
[tkaiser@mio cwp]$ tar -xf *tar
60
A digression - Makefiles
•
•
•
•
•
•
Makefiles are instructions for building a program
Normal names are makefile or Makefile
In a directory that contains a makefile type the
command
•
make
They can be very complicated
Can have “sub commands”
See: http://www.eng.hawaii.edu/Tutor/Make/
61
Our Makefile
all: c_ex01 f_ex01 pointer invertc
c_ex01 : c_ex01.c
! mpicc -o c_ex01 c_ex01.c
f_ex01: f_ex01.f90
! mpif90 -o f_ex01 f_ex01.f90
pointer: pointer.f90
! ifort -openmp pointer.f90 -mkl -o pointer
invertc: invertc.c
! icc -openmp invertc.c -o invertc
clean:
! rm -rf c_ex01 f_ex01 *mod err* out* mynodes* pointer invertc
tar:
! tar -cf introduction.tar makefile batch1 c_ex01.c f_ex01.f90 pointer.f90 invertc.c batch2
62
The End
63
Download