ISTeC Cray High-Performance Computing System Richard Casey, PhD RMRCE CSU Center for Bioinformatics System Architecture Compute blades (batch compute nodes) SeaStar 2+ Interconnect Compute blades (interactive compute nodes) Login node; Boot node; Lustre file system node Front Back XT6m Compute Node Architecture Greyhound DDR3 Channel 6MB L3 Cache Greyhound HT3 Greyhound Greyhound Greyhound Greyhound HT3 Greyhound Greyhound Greyhound Greyhound 6MB L3 Cache Greyhound Greyhound Greyhound Greyhound DDR3 Channel DDR3 Channel Greyhound Greyhound HT3 Greyhound To Interconnect • • • • • • • • DDR3 Channel Greyhound Greyhound Greyhound DDR3 Channel Greyhound 6MB L3 Cache HT3 6MB L3 Cache Greyhound Greyhound HT3 DDR3 Channel DDR3 Channel Greyhound HT Each compute node contains 2 processors (2 sockets) 64-bit AMD Opteron “Magny-Cours” 1.9Ghz processors 1 NUMA processor = 6 cores 4 NUMA processors per compute node 24 cores per compute node 4 NUMA processors per compute blade 32 GB RAM (shared) / compute node = 1.664 TB total RAM (ECC DDR3 SDRAM) 1.33 GB RAM / core DDR3 Channel Compute Node Status • Check whether interactive and batch compute nodes are up or down: – xtprocadmin NID 12 13 14 15 16 17 18 42 43 44 45 61 62 63 (HEX) 0xc 0xd 0xe 0xf 0x10 0x11 0x12 0x2a 0x2b 0x2c 0x2d 0x3d 0x3e 0x3f NODENAME c0-0c0s3n0 c0-0c0s3n1 c0-0c0s3n2 c0-0c0s3n3 c0-0c0s4n0 c0-0c0s4n1 c0-0c0s4n2 c0-0c1s2n2 c0-0c1s2n3 c0-0c1s3n0 c0-0c1s3n1 c0-0c1s7n1 c0-0c1s7n2 c0-0c1s7n3 TYPE compute compute compute compute compute compute compute compute compute compute compute compute compute compute STATUS up up up up up up up up up up up up up up MODE interactive interactive interactive interactive interactive interactive interactive batch batch batch batch batch batch batch Naming convention: CabinetX-Y Cage-X Slot-X Node-X i.e. Cabinet0-0,Cage0,Slot3,Node0 • Currently • 1,248 batch compute cores (fluctuates somewhat) • 192 interactive compute cores (fluctuates somewhat) Compute Node Status • Check the state of interactive and batch compute nodes and whether they are already allocated to other user’s jobs: – xtnodestat Current Allocation Status at Tue Apr 19 08:15:02 2011 Cabinet ID Service Nodes Cage X: Node X Slots (=blades) C0-0 n3 -------B n2 -------B n1 -------c1n0 -------n3 SSSaa;-n2 aa;-n1 aa;-c0n0 SSSaa;-s01234567 Batch Compute Nodes Allocated Batch Compute Nodes Free Batch Compute Nodes Interactive Compute Nodes Allocated Interactive Compute Nodes Free Interactive Compute Nodes Legend: nonexistent node ; free interactive compute node A allocated, but idle compute node X down compute node Z admindown compute node Available compute nodes: S ? Y service node (login, boot, lustrefs) free batch compute node suspect compute node down or admindown service node 4 interactive, 38 batch Batch Queues • • Current batch queue configuration Under re-evaluation - may change in future to fair-share queues Queue_name Priority Max_runtime (wallclock) small medium large ccm_queue priority_queue batch woodward woodward_ccm EFS high medium low ------------- 1 hr. 24 hrs. 168 hrs. (1 week) ------------- Max_num_jobs_per_user 20 2 1 ------------- Batch Jobs • PBS/Torque/Moab Batch Queue Management System – For submission and management of jobs in batch queues – Use for jobs with large resource requirements (long-running, # of cores, memory, etc.) • List all available queues: – qstat –Q (brief) – qstat –Qf (full) rcasey@cray2:~> qstat -Q Queue Max Tot -------------------batch 0 0 Ena --yes Str --yes Que --0 Run --0 Hld --0 Wat --0 Trn --0 Ext T --- 0 E • Show the status of jobs in all queues: – qstat (all queued jobs) – qstat – u username (only queued jobs for “username”) – (Note: if there are no jobs running in any of the batch queues, this command will show nothing and just return the Linux prompt). rcasey@cray2:~/lustrefs/mpi_c> qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----1753.sdb mpic.job rcasey 0 R batch Batch Jobs • Common Job States – – – – Q: job is queued R: job is running E: job is exiting after having run C: job is completed after having run • Submit a job to the default batch queue: – – – – qsub filename “filename” is the name of a file that contains batch queue commands Command line directives override batch script directives i.e. “qsub –N newname script”; “newname” overrides “-N name” in batch script • Delete a job from the batch queues: – qdel jobid – “jobid” is the job ID number as displayed by the “qstat” command. You must be the owner of the job in order to delete it. Sample Batch Job Script #!/bin/bash #PBS –N jobname #PBS –j oe #PBS –l mppwidth=24 #PBS –l mppdepth=1 #PBS –l walltime=1:00:00 #PBS –q small cd $PBS_O_WORKDIR date export OMP_NUM_THREADS=1 aprun –n24 –d1 executable • Batch queue directives: – – – – – – -N -j oe -l mppwidth -l mppdepth -l walltime -q name of the job combine standard output and standard error in single file specifies number of cores to allocate to job (MPI tasks) specifies number of threads per core (OpenMP) specifies maximum amount of wall clock time for job to run (hh:mm:ss) specify which queue to submit the job to (if none specified, job is sent to small queue) Sample Batch Job Script • PBS_O_WORKDIR environment variable is generated by Torque/PBS. Contains absolute path to directory from which you submitted your job. Required for Torque/PBS to find your executable files. • Linux commands and environment variables can be included in batch job script • The value set in aprun “-n” parameter should match value set in PBS “mppwidth” directive • • i.e. #PBS –l mppwidth=24 i.e. aprun –n 24 exe • Request proper resources: • • • If “-n” or “mppwidth” > 1,248, job will be held in queued state for awhile and then deleted If “mppwidth” < “-n”, then error message “apsched: claim exceeds reservation's nodecount” If “mppwidth” > “-n”, then OK Sample Batch Job Script • • • For MPI code ALPS places MPI tasks sequentially on cores within compute node If mppwidth = n > 24, ALPS places MPI tasks on multiple compute nodes #PBS #PBS #PBS #PBS #PBS -N –j -l -l –q mpicode oe mppwidth=12 walltime=00:10:00 small # mppwidth = -n = number of cores cd $PBS_O_WORKDIR cc -o mpicode mpicode.c aprun –n12 ./mpicode Sample Batch Job Script • • • • For OpenMP code ALPS places OpenMP threads sequentially on cores within compute node mppdepth = OMP_NUM_THREADS = -d <= 24 If –d exceeds 24 get error message - “apsched: -d value cannot exceed largest node size” #PBS #PBS #PBS #PBS #PBS -N –j -l -l -q openmpcode oe mppdepth=6 walltime=00:10:00 small # mppdepth = OMP_NUM_THREADS = -d <= 24 number of cores cd $PBS_O_WORKDIR export OMP_NUM_THREADS=6 cc -o openmpcode openmpcode.c aprun –d6 ./openmpcode Sample Batch Job Script • • • For hybrid MPI / OpenMP code ALPS places MPI tasks sequentially on cores within compute node & launches OpenMP threads per MPI task By default, ALPS places one OpenMP thread per MPI task; use mppdepth = OMP_NUM_THREADS = -d to change number of threads per task #PBS #PBS #PBS #PBS #PBS #PBS -N –j -l -l -l –q hybrid oe mppwidth=6 mppdepth=2 walltime=00:10:00 small # mppwidth = -n = number of cores # mppdepth = OMP_NUM_THREADS = -d <= 24 number of OpenMP threads per core cd $PBS_O_WORKDIR export OMP_NUM_THREADS=2 cc -o hybridcode hybridcode.c aprun –n6 –d2 ./hybridcode