Document

advertisement
Additional Topics in
Research Computing
Shelley Knuth
shelley.knuth@colorado.edu
www.rc.colorado.edu
Research Computing @ CU Boulder
Peter Ruprecht
peter.ruprecht@colorado.edu
Outline
• Brief overview of Janus
• Update on replacement for Janus
• Alternate compute environment – condo
• PetaLibrary storage
• General supercomputing information
• Queues and job submission
• Keeping the system happy
• Using and installing software
• Compiling and submission examples
• Time for questions, feedback, and one-on-one
Research Computing @ CU Boulder
UCD/AMC
2
4/3/15
Research Computing - Contact
• rc-help@colorado.edu
• Main contact email
• Will go into our ticket system
• http://www.rc.colorado.edu - main web site
• http://data.colorado.edu - data management resource
• Mailing Lists (announcements and collaboration)
https://lists.rc.colorado.edu/mailman/listinfo/rc-announce
• Twitter @CUBoulderRC and @cu_data
• Facebookhttps://www.facebook.com/CuBoulderResearchCom
puting
• Meetuphttp://www.meetup.com/University-of-ColoradoComputational-Science-and-Engineering/
Research Computing @ CU Boulder
UCD/AMC
4
4/3/15
JANUS SUPERCOMPUTER AND
OTHER COMPUTE RESOURCES
Research Computing @ CU Boulder
UCD/AMC
5
4/3/15
What Is a Supercomputer?
• CU’s supercomputer is one large computer
made up of many smaller computers and
processors
• Each different computer is called a node
• Each node has processors/cores
• Carry out the instructions of the computer
• With a supercomputer, all these different
computers talk to each other through a
communications network
• On Janus – InfiniBand
Research Computing @ CU Boulder
UCD/AMC
6
4/3/15
Hardware - Janus Supercomputer
• 1368 compute nodes (Dell
C6100)
• 16,428 total cores
• No battery backup of the
compute nodes
• Fully non-blocking QDR
Infiniband network
• 960 TB of usable Lustre based
scratch storage
• 16-20 GB/s max throughput
Research Computing @ CU Boulder
UCD/AMC
1
0
4/3/15
Different Node Types
• Login nodes
• This is where you are when you log in to login.rc.colorado.edu
• No heavy computation, interactive jobs, or long running
processes
• Script or code editing, minor compiling
• Job submission
• Compile nodes
• Compiling code
• Job submission
• Compute/batch nodes
• This is where jobs that are submitted through the scheduler run
• Intended for heavy computation
Research Computing @ CU Boulder
UCD/AMC
11
4/3/15
Replacing Janus
• Janus goes off warranty in July, but we expect to run it (or
part of it) for another 9-12 months after that.
• Have applied for a National Science Foundation grant for a
replacement cluster.
•
•
•
•
•
About 3x the FLOPS performance of Janus
Approx 350 compute nodes, 24 cores, 128 GB RAM
Approx 20 nodes with GPU coprocessors
Approx 20 nodes with Xeon Phi processors
Scratch storage suited to a range of applications
• Decision from NSF expected by early Sept. Installation
would begin early in 2016.
Research Computing @ CU Boulder
UCD/AMC
13
4/3/15
“Condo” Cluster
• CU provides infrastructure such as network, data center
space, and scratch storage
• Partners buy nodes that fit into the infrastructure
• Partners get priority access on their nodes
• Other condo partners can backfill onto nodes not currently in
use
• Node options will be updated with current hardware each
year
• A given generation of nodes will run 5 years
• Little or no customization of OS and hardware options
Research Computing @ CU Boulder
UCD/AMC
14
4/3/15
Condo Node Options - 2015
Compute node:
• 2x 12-core 2.5 GHz Intel “Haswell” processors
• 128 GB RAM
• 1x 1 TB 7200 RPM hard drive
• 10 gigabit/s Ethernet
• Potential uses: general purpose high-performance computation that
doesn’t require low-latency parallel communication between nodes
• $6,875
GPU node:
• 2x 12-core 2.5 GHz Intel “Haswell” processors
• 128 GB RAM
• 1x 1 TB 7200 RPM hard drive
• 10 gigabit/s Ethernet
• 2x NVIDIA K20M GPU coprocessors
• Potential uses: molecular dynamics, image processing
• $12,509
Research Computing @ CU Boulder
UCD/AMC
15
4/3/15
Node Options – cont’d
High-memory node
• 4x 12-core 3.0 GHz Intel “Ivy Bridge” processors
• 1024 GB RAM
• 10x 1 TB 7200 RPM hard drives in high-performance RAID
configuration
• 10 gigabit/s Ethernet
• Potential uses: genetics/genomics or finite-element applications
• $34,511
Notes:
• Dell is providing bulk discount pricing even for incremental
purchases
• 5-year warranty
• Network infrastructure in place to support over 100 nodes at 10
Gbps
Research Computing @ CU Boulder
UCD/AMC
16
4/3/15
Condo Scheduling Overview
• All jobs are run through a batch/queue system (Slurm.)
• Interactive jobs on compute nodes are allowed but these
must be initiated through the scheduler.
• Each partner group has its own high-priority queue for jobs
that will run on nodes that it has purchased.
• High-priority jobs are guaranteed to start running within 24
hours (unless other high-priority jobs are already running)
and can run for a maximum wall time of 14 days.
• All partners also have access to a low-priority queue that can
run on any condo nodes that are not already in use by their
owners.
Research Computing @ CU Boulder
UCD/AMC
17
4/3/15
RMACC conference
• Rocky Mountain Advanced Computing
Consortium
• August 11-13, 2015, Wolf Law Building, CUBoulder campus
• https://www.rmacc.org/HPCSymposium
• Hands on and lecture tutorials about HPC, data,
and technical topics
• Educational and student focused panels
• Poster session for students
Research Computing @ CU Boulder
UCD/AMC
1
8
4/3/15
DATA STORAGE AND TRANSFER
Research Computing @ CU Boulder
UCD/AMC
1
9
4/3/15
Storage Spaces
• Home Directories
• Not high performance; not for direct computational output
• 2 GB quota; snapshotted frequently
• /home/user1234
• Project Spaces
• Not high performance; can be used to store or share programs,
input files, maybe small data files
• 250 GB quota; snapshotted regularly
• /projects/user1234
• Lustre Parallel Scratch Filesystem
• No hard quotas; no backups
• Files created more than 180 days in the past may be purged at
any time
• /lustre/janus_scratch/user1234
Research Computing @ CU Boulder
UCD/AMC
2
0
4/3/15
Research Data Storage: PetaLibrary
• Long term, large scale storage option
• “Active” or “Archive”
• Primary storage system for condo
• Modest charge to use
• Provide expertise and services around this
storage
• Data management
• Consulting
Research Computing @ CU Boulder
UCD/AMC
2
1
4/3/15
PetaLibrary: Cost to Researchers
Service offered
Cost/TB/yr
Active
$100
Active plus replication
$200
Active plus one copy on tape
$120
Archive (HSM)
$60
Archive plus one copy on tape
$80
• Additional free options
• Snapshots: on-disk incremental backups
• Sharing through Globus Online
Research Computing @ CU Boulder
UCD/AMC
2
2
4/3/15
ACCESSING AND UTILIZING RC
RESOURCES
Research Computing @ CU Boulder
UCD/AMC
2
5
4/3/15
Initial Steps to Use RC Systems
• Apply for an RC account
• https://portals.rc.colorado.edu/account/request
• Read Terms and Conditions
• Choose “University of Colorado – Denver”
• Use your UCD email address
• Get a One-Time Password device
• Apply for a computing allocation
• Initially you will be added to UCD’s existing umbrella
allocation
• Subsequently will need a brief proposal for additional
time
Research Computing @ CU Boulder
UCD/AMC
26
4/3/15
Logging In
• SSH to login.rc.colorado.edu
• Use your RC username and One-Time Password
• From a login node, can SSH to a compile node janus-compile1(2,3,4) - to build your programs
• All RC nodes use RedHat Enterprise Linux 6
Research Computing @ CU Boulder
UCD/AMC
27
4/3/15
Job Scheduling
• On supercomputer, jobs are scheduled rather than just run
instantly at the command line
• Several different scheduling software packages are in
common use
• Licensing, performance, and functionality issues have
caused us to choose Slurm
• SLURM = Simple Linux Utility for Resource Management
• Open source
• Increasingly popular at other sites
More at https://www.rc.colorado.edu/support/examples/slurmtestjob
Research Computing @ CU Boulder
UCD/AMC
2
8
4/3/15
Running Jobs
• Do NOT compute on login or compile nodes
• Interactive jobs
• Work interactively at the command line of a compute node
• Command: salloc --qos=janus-debug
•
sinteractive --qos=janus-debug
• Batch jobs
• Submit job that will be executed when resources are
available
• Create a text file containing information about the job
• Submit the job file to a queue
• sbatch --qos=<queue> file
Research Computing @ CU Boulder
UCD/AMC
29
4/3/15
Quality of Service = Queues
• janus-debug
• Only debugging – no production work
• Maximum wall time 1 hour, 2 jobs per user
• (The maximum amount of time your job is allowed to run)
• janus
• Default
• Maximum wall time of 24 hours
• janus-long
• Maximum wall time of 168 hours; limited number of nodes
• himem
• crestone
• gpu
Research Computing @ CU Boulder
UCD/AMC
30
4/3/15
Parallel Computation
• Special software is required to run a calculation across more
than one processor core or more than one node
• Sometimes the compiler can automatically parallelize an
algorithm, and frequently special libraries and functions can
be used to abstract the parallelism
• OpenMP
• For multiprocessing across cores in a single node
• Most compilers can auto-parallelize with OpenMP
• MPI – Message Passing Interface
•
•
•
•
Can parallelize across multiple nodes
Libraries/functions compiled into executable
Several implementations (OpenMPI, mvapich, mpich, …)
Usually requires a fast interconnect (like InfiniBand)
Research Computing @ CU Boulder
UCD/AMC
31
4/3/15
Software
• Common software is available to all users via the
“module” system
• To find out what software is available, you can type
module avail
• To set up your environment to use a software
package, type module load <package>/<version>
• Can install your own software
• But you are responsible for support
• We are happy to assist
Research Computing @ CU Boulder
UCD/AMC
32
4/3/15
EXAMPLES
Research Computing @ CU Boulder
UCD/AMC
3
3
4/3/15
Login and Modules example
• Log in:
• ssh username@login.rc.colorado.edu
• List and load modules
•
•
•
•
module
module
module
module
list
avail matlab
load matlab/matlab-2014b
load slurm
Research Computing @ CU Boulder
UCD/AMC
3
5
4/3/15
Batch Job example
• Contents of scripts
• Matlab_tutorial_general.sh
• Wrapper script loads the slurm commands
• Changes to the appropriate directory
• Calls the matlab .m files
• Matlab_tutorial_general_code.m
• Run matlab program as a batch job
• sbatch matlab_tutorial_general.sh
• Check job status:
• squeue –q janus-debug
• cat SERIAL.out
Research Computing @ CU Boulder
UCD/AMC
3
7
4/3/15
Example to work on
• Take the wrapper script matlab_tutorial_general.sh
• Alter it to do the following:
• Rename your output file to whatever you’d like
• Email yourself the output; do it at the beginning
of the job run
• Run on 12 cores
• Run on 2 nodes
• Rename your job to whatever you’d like
• Submit the batch script
Research Computing @ CU Boulder
UCD/AMC
3
8
4/3/15
Thank you!
Questions?
Slides available at
http://researchcomputing.github.io
Research Computing @ CU Boulder
Download