Presentation slides

advertisement
Introduction to HPC Workshop
October 9 2014
Introduction
Rob Lane
HPC Support
Research Computing Services
CUIT
Introduction
HPC Basics
Introduction
First HPC Workshop
Yeti
• 2 head nodes
• 101 execute nodes
• 200 TB storage
Yeti
• 101 execute nodes
– 38 x 64 GB
– 8 x 128 GB
– 35 x 256 GB
– 16 x 64 GB + Infiniband
– 4 x 64 GB + nVidia K20 GPU
Yeti
• CPU
– Intel E5-2650L
– 1.8 GHz
– 8 Cores
– 2 per Execute Node
Yeti
• Expansion Round
– 66 new systems
– Faster CPU
– More Infiniband
– More GPU (nVidia K40)
– ETA January 2015
Yeti
HP S6500 Chassis
HP SL230 Server
Job Scheduler
• Manages the cluster
• Decides when a job will run
• Decides where a job will run
• We use Torque/Moab
Job Queues
• Jobs are submitted to a queue
• Jobs sorted in priority order
• Not a FIFO
Access
Mac Instructions
1. Run terminal
Access
1.
2.
3.
4.
5.
Windows Instructions
Search for putty on Columbia home page
Select first result
Follow link to Putty download page
Download putty.exe
Run putty.exe
Access
Mac (Terminal)
$ ssh UNI@yetisubmit.cc.columbia.edu
Windows (Putty)
Host Name: yetisubmit.cc.columbia.edu
Work Directory
$ cd /vega/free/users/your UNI
• Replace “your UNI” with your UNI
$ cd /vega/free/users/hpc2108
Copy Workshop Files
• Files are in /tmp/workshop
$ cp /tmp/workshop/* .
Editing
No single obvious choice for editor
• vi – simple but difficult at first
• emacs – powerful but complex
• nano – simple but not really standard
nano
$ nano hellosubmit
“^” means “hold down control”
^a : go to beginning of line
^e : go to end of line
^k: delete line
^o: save file
^x: exit
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
# Set output and error directories
#PBS -o localhost:/vega/free/users/UNI
#PBS -e localhost:/vega/free/users/UNI
# Print "Hello World"
echo "Hello World"
# Sleep for 10 seconds
sleep 10
# Print date and time
date
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
# Set output and error directories
#PBS -o localhost:/vega/free/users/UNI
#PBS -e localhost:/vega/free/users/UNI
# Print "Hello World"
echo "Hello World"
# Sleep for 10 seconds
sleep 10
# Print date and time
date
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
n
hellosubmit
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
HelloWorld
group_list=yetifree
nodes=1:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
n
hellosubmit
# Set output and error directories
#PBS -o localhost:/vega/free/users/UNI
#PBS -e localhost:/vega/free/users/UNI
hellosubmit
# Set output and error directories
#PBS -o localhost:/vega/free/users/UNI
#PBS -e localhost:/vega/free/users/UNI
hellosubmit
# Print "Hello World"
echo "Hello World"
# Sleep for 10 seconds
sleep 10
# Print date and time
date
hellosubmit
$ qsub hellosubmit
hellosubmit
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$
hellosubmit
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$
qstat
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$ qstat 298151
Job ID
Name
User
Time Use S Queue
---------- ------------ ---------- -------- - ----298151.elk HelloWorld
hpc2108
0 Q batch1
hellosubmit
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$ qstat 298151
Job ID
Name
User
Time Use S Queue
---------- ------------ ---------- -------- - ----298151.elk HelloWorld
hpc2108
0 Q batch1
hellosubmit
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$ qstat 298151
Job ID
Name
User
Time Use S Queue
---------- ------------ ---------- -------- - ----298151.elk HelloWorld
hpc2108
0 Q batch1
hellosubmit
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$ qstat 298151
Job ID
Name
User
Time Use S Queue
---------- ------------ ---------- -------- - ----298151.elk HelloWorld
hpc2108
0 Q batch1
hellosubmit
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$ qstat 298151
Job ID
Name
User
Time Use S Queue
---------- ------------ ---------- -------- - ----298151.elk HelloWorld
hpc2108
0 Q batch1
hellosubmit
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$ qstat 298151
Job ID
Name
User
Time Use S Queue
---------- ------------ ---------- -------- - ----298151.elk HelloWorld
hpc2108
0 Q batch1
hellosubmit
$ qsub hellosubmit
298151.elk.cc.columbia.edu
$ qstat 298151
Job ID
Name
User
Time Use S Queue
---------- ------------ ---------- -------- - ----298151.elk HelloWorld
hpc2108
0 Q batch1
$ qstat 298151
qstat: Unknown Job Id Error 298151.elk.cc.columbia.edu
hellosubmit
$ ls -l
total 4
-rw------- 1 hpc2108 yetifree 398 Oct
-rw------- 1 hpc2108 yetifree
0 Oct
-rw------- 1 hpc2108 yetifree 41 Oct
8 22:13 hellosubmit
8 22:44 HelloWorld.e298151
8 22:44 HelloWorld.o298151
hellosubmit
$ ls -l
total 4
-rw------- 1 hpc2108 yetifree 398 Oct
-rw------- 1 hpc2108 yetifree
0 Oct
-rw------- 1 hpc2108 yetifree 41 Oct
8 22:13 hellosubmit
8 22:44 HelloWorld.e298151
8 22:44 HelloWorld.o298151
hellosubmit
$ ls -l
total 4
-rw------- 1 hpc2108 yetifree 398 Oct
-rw------- 1 hpc2108 yetifree
0 Oct
-rw------- 1 hpc2108 yetifree 41 Oct
8 22:13 hellosubmit
8 22:44 HelloWorld.e298151
8 22:44 HelloWorld.o298151
hellosubmit
$ ls -l
total 4
-rw------- 1 hpc2108 yetifree 398 Oct
-rw------- 1 hpc2108 yetifree
0 Oct
-rw------- 1 hpc2108 yetifree 41 Oct
8 22:13 hellosubmit
8 22:44 HelloWorld.e298151
8 22:44 HelloWorld.o298151
hellosubmit
$ ls -l
total 4
-rw------- 1 hpc2108 yetifree 398 Oct
-rw------- 1 hpc2108 yetifree
0 Oct
-rw------- 1 hpc2108 yetifree 41 Oct
8 22:13 hellosubmit
8 22:44 HelloWorld.e298151
8 22:44 HelloWorld.o298151
hellosubmit
$ cat HelloWorld.o298151
Hello World
Thu Oct 9 12:44:05 EDT 2014
hellosubmit
$ cat HelloWorld.o298151
Hello World
Thu Oct 9 12:44:05 EDT 2014
Any Questions?
Interactive
•
•
•
•
Most jobs run as “batch”
Can also run interactive jobs
Get a shell on an execute node
Useful for development, testing,
troubleshooting
Interactive
$ cat interactive
qsub -I -W group_list=yetifree -l walltime=5:00,mem=100mb
Interactive
$ cat interactive
qsub -I -W group_list=yetifree -l walltime=5:00,mem=100mb
Interactive
$ cat interactive
qsub -I -W group_list=yetifree -l walltime=5:00,mem=100mb
Interactive
$ cat interactive
qsub -I -W group_list=yetifree -l walltime=5:00,mem=100mb
Interactive
$ cat interactive
qsub -I -W group_list=yetifree -l walltime=5:00,mem=100mb
Interactive
$ cat interactive
qsub -I -W group_list=yetifree -l walltime=5:00,mem=100mb
Interactive
$ qsub -I -W group_list=yetifree -l walltime=5:00,mem=100mb
qsub: waiting for job 298158.elk.cc.columbia.edu to start
Interactive
qsub: job 298158.elk.cc.columbia.edu ready
.--.
,-,-,--(/o o\)-,-,-,.
,'
// oo \\
',
,'
/| __ |\
',
,'
//\,__,/\\
',
,
/\
/\
,
,
/'`\
/' \
,
| /' `\
/' '\ |
| \
(
)
/ |
( /\|
/'
'\
|/\ )
\|
/'
/'`\
'\
|/
|
/' `\
|
(
(
)
)
`\
\
/'
/'
`\
\ /'
/'
/
/ \
\
v v v
v v v
+--------------------------------+
|
|
| You are in an interactive job. |
|
|
|
Your walltime is 00:05:00
|
|
|
+--------------------------------+
Interactive
$ hostname
charleston.cc.columbia.edu
Interactive
$ exit
logout
qsub: job 298158.elk.cc.columbia.edu completed
$
GUI
• Can run GUI’s in interactive jobs
• Need X Server on your local system
• See user documentation for more
information
User Documentation
• hpc.cc.columbia.edu
• Go to “HPC Support”
• Click on Yeti user documentation
Job Queues
• Scheduler puts all jobs into a queue
• Queue selected automatically
• Queues have different settings
Job Queues
Queue
Time Limit
Memory
Limit
Max. User
Run
Batch 1
12 hours
4 GB
512
Batch 2
12 hours
16 GB
128
Batch 3
5 days
16 GB
64
Batch 4
3 days
None
8
Interactive
4 hours
None
4
qstat -q
$ qstat -q
server: elk.cc.columbia.edu
Queue
Memory CPU Time Walltime Node
---------------- ------ -------- -------- ---batch1
4gb
-12:00:00
-batch2
16gb
-12:00:00
-batch3
16gb
-120:00:0
-batch4
--72:00:00
-interactive
--04:00:00
-interlong
--48:00:00
-route
-----
Run Que Lm
--- --- -42 15 -129 73 -148 261 -11 12 -0
1 -0
0 -0
0 ------ ----330
362
State
----E R
E R
E R
E R
E R
E R
E R
yetifree
• Maximum processors limited
– Currently 4 maximum
• Storage quota
– 16 GB
• No email support 
yetifree
$ quota -s
Disk quotas for user hpc2108 (uid 242275):
Filesystem blocks
quota
limit
grace
hpc-cuit-storage-2.cc.columbia.edu:/free/
122M 16384M 16384M
files
quota
limit
8
4295m
4295m
grace
yetifree
$ quota -s
Disk quotas for user hpc2108 (uid 242275):
Filesystem blocks
quota
limit
grace
hpc-cuit-storage-2.cc.columbia.edu:/free/
122M 16384M 16384M
files
quota
limit
8
4295m
4295m
grace
email
from:
to:
date:
subject:
root <hpc-noreply@columbia.edu>
hpc2108@columbia.edu
Wed, Oct 8, 2014 at 11:41 PM
PBS JOB 298161.elk.cc.columbia.edu
PBS Job Id: 298161.elk.cc.columbia.edu
Job Name:
HelloWorld
Exec host: dublin.cc.columbia.edu/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:02
resources_used.mem=8288kb
resources_used.vmem=304780kb
resources_used.walltime=00:02:02
Error_Path: localhost:/vega/free/users/hpc2108/HelloWorld.e298161
Output_Path: localhost:/vega/free/users/hpc2108/HelloWorld.o298161
email
from:
to:
date:
subject:
root <hpc-noreply@columbia.edu>
hpc2108@columbia.edu
Wed, Oct 8, 2014 at 11:41 PM
PBS JOB 298161.elk.cc.columbia.edu
PBS Job Id: 298161.elk.cc.columbia.edu
Job Name:
HelloWorld
Exec host: dublin.cc.columbia.edu/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:02
resources_used.mem=8288kb
resources_used.vmem=304780kb
resources_used.walltime=00:02:02
Error_Path: localhost:/vega/free/users/hpc2108/HelloWorld.e298161
Output_Path: localhost:/vega/free/users/hpc2108/HelloWorld.o298161
email
from:
to:
date:
subject:
root <hpc-noreply@columbia.edu>
hpc2108@columbia.edu
Wed, Oct 8, 2014 at 11:41 PM
PBS JOB 298161.elk.cc.columbia.edu
PBS Job Id: 298161.elk.cc.columbia.edu
Job Name:
HelloWorld
Exec host: dublin.cc.columbia.edu/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:02
resources_used.mem=8288kb
resources_used.vmem=304780kb
resources_used.walltime=00:02:02
Error_Path: localhost:/vega/free/users/hpc2108/HelloWorld.e298161
Output_Path: localhost:/vega/free/users/hpc2108/HelloWorld.o298161
email
from:
to:
date:
subject:
root <hpc-noreply@columbia.edu>
hpc2108@columbia.edu
Wed, Oct 8, 2014 at 11:41 PM
PBS JOB 298161.elk.cc.columbia.edu
PBS Job Id: 298161.elk.cc.columbia.edu
Job Name:
HelloWorld
Exec host: dublin.cc.columbia.edu/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:02
resources_used.mem=8288kb
resources_used.vmem=304780kb
resources_used.walltime=00:02:02
Error_Path: localhost:/vega/free/users/hpc2108/HelloWorld.e298161
Output_Path: localhost:/vega/free/users/hpc2108/HelloWorld.o298161
email
from:
to:
date:
subject:
root <hpc-noreply@columbia.edu>
hpc2108@columbia.edu
Wed, Oct 8, 2014 at 11:41 PM
PBS JOB 298161.elk.cc.columbia.edu
PBS Job Id: 298161.elk.cc.columbia.edu
Job Name:
HelloWorld
Exec host: dublin.cc.columbia.edu/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:02
resources_used.mem=8288kb
resources_used.vmem=304780kb
resources_used.walltime=00:02:02
Error_Path: localhost:/vega/free/users/hpc2108/HelloWorld.e298161
Output_Path: localhost:/vega/free/users/hpc2108/HelloWorld.o298161
email
from:
to:
date:
subject:
root <hpc-noreply@columbia.edu>
hpc2108@columbia.edu
Wed, Oct 8, 2014 at 11:41 PM
PBS JOB 298161.elk.cc.columbia.edu
PBS Job Id: 298161.elk.cc.columbia.edu
Job Name:
HelloWorld
Exec host: dublin.cc.columbia.edu/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:02
resources_used.mem=8288kb
resources_used.vmem=304780kb
resources_used.walltime=00:02:02
Error_Path: localhost:/vega/free/users/hpc2108/HelloWorld.e298161
Output_Path: localhost:/vega/free/users/hpc2108/HelloWorld.o298161
email
from:
to:
date:
subject:
root <hpc-noreply@columbia.edu>
hpc2108@columbia.edu
Wed, Oct 8, 2014 at 11:41 PM
PBS JOB 298161.elk.cc.columbia.edu
PBS Job Id: 298161.elk.cc.columbia.edu
Job Name:
HelloWorld
Exec host: dublin.cc.columbia.edu/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:02
resources_used.mem=8288kb
resources_used.vmem=304780kb
resources_used.walltime=00:02:02
Error_Path: localhost:/vega/free/users/hpc2108/HelloWorld.e298161
Output_Path: localhost:/vega/free/users/hpc2108/HelloWorld.o298161
email
from:
to:
date:
subject:
root <hpc-noreply@columbia.edu>
hpc2108@columbia.edu
Wed, Oct 8, 2014 at 11:41 PM
PBS JOB 298161.elk.cc.columbia.edu
PBS Job Id: 298161.elk.cc.columbia.edu
Job Name:
HelloWorld
Exec host: dublin.cc.columbia.edu/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:02
resources_used.mem=8288kb
resources_used.vmem=304780kb
resources_used.walltime=00:02:02
Error_Path: localhost:/vega/free/users/hpc2108/HelloWorld.e298161
Output_Path: localhost:/vega/free/users/hpc2108/HelloWorld.o298161
Intern
• Research Computing Services (RCS) is
looking for an intern
• Paid position
• ~10 hours a week
• Will be on LionShare next week
MPI
• Message Passing Interface
• Allows applications to run across multiple
computers
MPI
• Edit MPI submit file
• Load MPI environment module
• Compile sample program
MPI
#!/bin/sh
# Directives
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
-W
-l
-M
-m
-V
MpiHello
group_list=yetifree
nodes=3:ppn=1,walltime=00:01:00,mem=20mb
UNI@columbia.edu
abe
# Set output and error directories
#PBS -o localhost:/vega/free/users/UNI
#PBS -e localhost:/vega/free/users/UNI
# Load mpi module.
module load openmpi
# Run mpi program.
mpirun mpihello
MPI
$ module load openmpi
$ which mpicc
/usr/local/openmpi/bin/mpicc
$ mpicc -o mpihello mpihello.c
MPI
$ qsub mpisubmit
298501.elk.cc.columbia.edu
Questions?
Any questions?
Download