Additional Topics in Research Computing Shelley Knuth shelley.knuth@colorado.edu www.rc.colorado.edu Research Computing @ CU Boulder Peter Ruprecht peter.ruprecht@colorado.edu Outline • Brief overview of Janus • Update on replacement for Janus • Alternate compute environment – condo • PetaLibrary storage • General supercomputing information • Queues and job submission • Keeping the system happy • Using and installing software • Compiling and submission examples • Time for questions, feedback, and one-on-one Research Computing @ CU Boulder UCD/AMC 2 4/3/15 Research Computing - Contact • rc-help@colorado.edu • Main contact email • Will go into our ticket system • http://www.rc.colorado.edu - main web site • http://data.colorado.edu - data management resource • Mailing Lists (announcements and collaboration) https://lists.rc.colorado.edu/mailman/listinfo/rc-announce • Twitter @CUBoulderRC and @cu_data • Facebookhttps://www.facebook.com/CuBoulderResearchCom puting • Meetuphttp://www.meetup.com/University-of-ColoradoComputational-Science-and-Engineering/ Research Computing @ CU Boulder UCD/AMC 4 4/3/15 JANUS SUPERCOMPUTER AND OTHER COMPUTE RESOURCES Research Computing @ CU Boulder UCD/AMC 5 4/3/15 What Is a Supercomputer? • CU’s supercomputer is one large computer made up of many smaller computers and processors • Each different computer is called a node • Each node has processors/cores • Carry out the instructions of the computer • With a supercomputer, all these different computers talk to each other through a communications network • On Janus – InfiniBand Research Computing @ CU Boulder UCD/AMC 6 4/3/15 Hardware - Janus Supercomputer • 1368 compute nodes (Dell C6100) • 16,428 total cores • No battery backup of the compute nodes • Fully non-blocking QDR Infiniband network • 960 TB of usable Lustre based scratch storage • 16-20 GB/s max throughput Research Computing @ CU Boulder UCD/AMC 1 0 4/3/15 Different Node Types • Login nodes • This is where you are when you log in to login.rc.colorado.edu • No heavy computation, interactive jobs, or long running processes • Script or code editing, minor compiling • Job submission • Compile nodes • Compiling code • Job submission • Compute/batch nodes • This is where jobs that are submitted through the scheduler run • Intended for heavy computation Research Computing @ CU Boulder UCD/AMC 11 4/3/15 Replacing Janus • Janus goes off warranty in July, but we expect to run it (or part of it) for another 9-12 months after that. • Have applied for a National Science Foundation grant for a replacement cluster. • • • • • About 3x the FLOPS performance of Janus Approx 350 compute nodes, 24 cores, 128 GB RAM Approx 20 nodes with GPU coprocessors Approx 20 nodes with Xeon Phi processors Scratch storage suited to a range of applications • Decision from NSF expected by early Sept. Installation would begin early in 2016. Research Computing @ CU Boulder UCD/AMC 13 4/3/15 “Condo” Cluster • CU provides infrastructure such as network, data center space, and scratch storage • Partners buy nodes that fit into the infrastructure • Partners get priority access on their nodes • Other condo partners can backfill onto nodes not currently in use • Node options will be updated with current hardware each year • A given generation of nodes will run 5 years • Little or no customization of OS and hardware options Research Computing @ CU Boulder UCD/AMC 14 4/3/15 Condo Node Options - 2015 Compute node: • 2x 12-core 2.5 GHz Intel “Haswell” processors • 128 GB RAM • 1x 1 TB 7200 RPM hard drive • 10 gigabit/s Ethernet • Potential uses: general purpose high-performance computation that doesn’t require low-latency parallel communication between nodes • $6,875 GPU node: • 2x 12-core 2.5 GHz Intel “Haswell” processors • 128 GB RAM • 1x 1 TB 7200 RPM hard drive • 10 gigabit/s Ethernet • 2x NVIDIA K20M GPU coprocessors • Potential uses: molecular dynamics, image processing • $12,509 Research Computing @ CU Boulder UCD/AMC 15 4/3/15 Node Options – cont’d High-memory node • 4x 12-core 3.0 GHz Intel “Ivy Bridge” processors • 1024 GB RAM • 10x 1 TB 7200 RPM hard drives in high-performance RAID configuration • 10 gigabit/s Ethernet • Potential uses: genetics/genomics or finite-element applications • $34,511 Notes: • Dell is providing bulk discount pricing even for incremental purchases • 5-year warranty • Network infrastructure in place to support over 100 nodes at 10 Gbps Research Computing @ CU Boulder UCD/AMC 16 4/3/15 Condo Scheduling Overview • All jobs are run through a batch/queue system (Slurm.) • Interactive jobs on compute nodes are allowed but these must be initiated through the scheduler. • Each partner group has its own high-priority queue for jobs that will run on nodes that it has purchased. • High-priority jobs are guaranteed to start running within 24 hours (unless other high-priority jobs are already running) and can run for a maximum wall time of 14 days. • All partners also have access to a low-priority queue that can run on any condo nodes that are not already in use by their owners. Research Computing @ CU Boulder UCD/AMC 17 4/3/15 RMACC conference • Rocky Mountain Advanced Computing Consortium • August 11-13, 2015, Wolf Law Building, CUBoulder campus • https://www.rmacc.org/HPCSymposium • Hands on and lecture tutorials about HPC, data, and technical topics • Educational and student focused panels • Poster session for students Research Computing @ CU Boulder UCD/AMC 1 8 4/3/15 DATA STORAGE AND TRANSFER Research Computing @ CU Boulder UCD/AMC 1 9 4/3/15 Storage Spaces • Home Directories • Not high performance; not for direct computational output • 2 GB quota; snapshotted frequently • /home/user1234 • Project Spaces • Not high performance; can be used to store or share programs, input files, maybe small data files • 250 GB quota; snapshotted regularly • /projects/user1234 • Lustre Parallel Scratch Filesystem • No hard quotas; no backups • Files created more than 180 days in the past may be purged at any time • /lustre/janus_scratch/user1234 Research Computing @ CU Boulder UCD/AMC 2 0 4/3/15 Research Data Storage: PetaLibrary • Long term, large scale storage option • “Active” or “Archive” • Primary storage system for condo • Modest charge to use • Provide expertise and services around this storage • Data management • Consulting Research Computing @ CU Boulder UCD/AMC 2 1 4/3/15 PetaLibrary: Cost to Researchers Service offered Cost/TB/yr Active $100 Active plus replication $200 Active plus one copy on tape $120 Archive (HSM) $60 Archive plus one copy on tape $80 • Additional free options • Snapshots: on-disk incremental backups • Sharing through Globus Online Research Computing @ CU Boulder UCD/AMC 2 2 4/3/15 ACCESSING AND UTILIZING RC RESOURCES Research Computing @ CU Boulder UCD/AMC 2 5 4/3/15 Initial Steps to Use RC Systems • Apply for an RC account • https://portals.rc.colorado.edu/account/request • Read Terms and Conditions • Choose “University of Colorado – Denver” • Use your UCD email address • Get a One-Time Password device • Apply for a computing allocation • Initially you will be added to UCD’s existing umbrella allocation • Subsequently will need a brief proposal for additional time Research Computing @ CU Boulder UCD/AMC 26 4/3/15 Logging In • SSH to login.rc.colorado.edu • Use your RC username and One-Time Password • From a login node, can SSH to a compile node janus-compile1(2,3,4) - to build your programs • All RC nodes use RedHat Enterprise Linux 6 Research Computing @ CU Boulder UCD/AMC 27 4/3/15 Job Scheduling • On supercomputer, jobs are scheduled rather than just run instantly at the command line • Several different scheduling software packages are in common use • Licensing, performance, and functionality issues have caused us to choose Slurm • SLURM = Simple Linux Utility for Resource Management • Open source • Increasingly popular at other sites More at https://www.rc.colorado.edu/support/examples/slurmtestjob Research Computing @ CU Boulder UCD/AMC 2 8 4/3/15 Running Jobs • Do NOT compute on login or compile nodes • Interactive jobs • Work interactively at the command line of a compute node • Command: salloc --qos=janus-debug • sinteractive --qos=janus-debug • Batch jobs • Submit job that will be executed when resources are available • Create a text file containing information about the job • Submit the job file to a queue • sbatch --qos=<queue> file Research Computing @ CU Boulder UCD/AMC 29 4/3/15 Quality of Service = Queues • janus-debug • Only debugging – no production work • Maximum wall time 1 hour, 2 jobs per user • (The maximum amount of time your job is allowed to run) • janus • Default • Maximum wall time of 24 hours • janus-long • Maximum wall time of 168 hours; limited number of nodes • himem • crestone • gpu Research Computing @ CU Boulder UCD/AMC 30 4/3/15 Parallel Computation • Special software is required to run a calculation across more than one processor core or more than one node • Sometimes the compiler can automatically parallelize an algorithm, and frequently special libraries and functions can be used to abstract the parallelism • OpenMP • For multiprocessing across cores in a single node • Most compilers can auto-parallelize with OpenMP • MPI – Message Passing Interface • • • • Can parallelize across multiple nodes Libraries/functions compiled into executable Several implementations (OpenMPI, mvapich, mpich, …) Usually requires a fast interconnect (like InfiniBand) Research Computing @ CU Boulder UCD/AMC 31 4/3/15 Software • Common software is available to all users via the “module” system • To find out what software is available, you can type module avail • To set up your environment to use a software package, type module load <package>/<version> • Can install your own software • But you are responsible for support • We are happy to assist Research Computing @ CU Boulder UCD/AMC 32 4/3/15 EXAMPLES Research Computing @ CU Boulder UCD/AMC 3 3 4/3/15 Login and Modules example • Log in: • ssh username@login.rc.colorado.edu • List and load modules • • • • module module module module list avail matlab load matlab/matlab-2014b load slurm Research Computing @ CU Boulder UCD/AMC 3 5 4/3/15 Batch Job example • Contents of scripts • Matlab_tutorial_general.sh • Wrapper script loads the slurm commands • Changes to the appropriate directory • Calls the matlab .m files • Matlab_tutorial_general_code.m • Run matlab program as a batch job • sbatch matlab_tutorial_general.sh • Check job status: • squeue –q janus-debug • cat SERIAL.out Research Computing @ CU Boulder UCD/AMC 3 7 4/3/15 Example to work on • Take the wrapper script matlab_tutorial_general.sh • Alter it to do the following: • Rename your output file to whatever you’d like • Email yourself the output; do it at the beginning of the job run • Run on 12 cores • Run on 2 nodes • Rename your job to whatever you’d like • Submit the batch script Research Computing @ CU Boulder UCD/AMC 3 8 4/3/15 Thank you! Questions? Slides available at http://researchcomputing.github.io Research Computing @ CU Boulder