Streamlining Research Computing Infrastructure ! ! A small school’s experience Gowtham HPC Research Scientist, ITS Adj. Asst. Professor, Physics/ECE ! g@mtu.edu (906) 487-3593 http://www.mtu.edu Houghton, MI 2 Houghton, MI 56 miles Twin Cities, MN 375 miles Sault Ste. Marie, MI Canada 265 miles Duluth, MN 215 miles Isle Royale National Park, MI Green Bay, WI 215 miles Detroit, MI 550 miles 3 Fall 2013 - Population - Houghton/Hancock: 15,000 (22,000) 1885 Michigan Tech - Faculty: 500 1897 - Students: 7,000 (5,600 +1,400) - General budget: $170 million 1927 - Staff: 1,000 - Endowment value: $83 million 4 1964 - Sponsored programs awards: $48 million An as is snapshot January 2011 - 8 mini to medium sized clusters - Spread around campus - Varying versions of Rocks - Different software configurations - Single power supply for most components - Manual systems administration and maintenance - Minimal end user training and documentation These 8 clusters — purchased mostly with start-up funds — had 1,000 CPU cores spanning several hardware generations and few low-end GPUs. Only one of them had InfiniBand (40 Gb/s). 5 Initial consolidation January 2011 — March 2011 - Move all clusters to one of two data centers - Merge clusters when possible - Consistent racking, cabling and labeling scheme - Upgrade to Rocks 5.4.2 R107B36 OB1 - Identical software configuration R107B41 P01 - End user training - Complete documentation Rack 107, Back side, 36th slot, On Board NIC 1 (of a node) Rack 107 Back side, 41st slot, Port 01 (of the switch) Compute nodes deemed not up to the mark were put away for building a test cluster: wigner.research.mtu.edu 6 Capture usage pattern April 2011 — December 2011 - hpcmonitor.it.mtu.edu - Ganglia monitoring system Monitoring multiple clusters with Ganglia: http://central6.rocksclusters.org/roll-documentation/ganglia/6.1/x111.html 7 Analysis of usage pattern January 2012 - Low usage - 20% on most days - 45-50% on luckiest of days - Inability and/or unwillingness to share resources - Lack of resources for researchers in need - More systems administrative work - Space, power and cooling costs - Less time for research, teaching and collaborations 8 The meeting January 2012 - VPR, Provost, CIO, CTO, Chair of HPC Committee and yours truly - Strongly encourage sharing of under-utilized clusters - End of life for existing individual clusters - Stop funding new individual clusters - Acquire one big centrally managed cluster - Central administration will fully support the new policies - One person committees - No exceptions for anyone 9 The philosophy January 2012 Greatest good for the greatest number - Warren Perger and Gifford Pinchot Much is said of the questions of this kind, about greatest good for the greatest number. But the greatest number too often is found to be one. It is never the greatest number in the common meaning of the term that makes the greatest noise and stir on questions mixed with money … - John Muir 10 The philosophy January 2012 It’s not just a keel and a hull and a deck and sails. That’s what a ship needs but not what a ship is. But what a ship is … what the Black Pearl Superior really is … is freedom. - Captain Jack Sparrow, Pirates of the Caribbean Adopted shamelessly from Henry Neeman’s SC11 presentation: Supercomputing in Plain English 11 Bidding/Acquiring process February 2012 — May 2013 - $750k for everything - $675k for hardware + 10% for unexpected expenses - 5 rounds with 4 vendors (2 local; 2 brand names) - Local vendor won the bid February 2013 - Staggered delivery of components April — May 2013 - Fly-wheel installation April — May 2013 - Load test with building and campus generators 12 wigner.research January 2011 — December 2013 - Built with retired nodes from other clusters - 1 front end - 2 login nodes First version of wigner had just two nodes: 1 front end and 1 compute node, built with retired lab PCs and no switch - 1 NAS node (2 TB RAID1 storage) - 32 compute nodes - 50+ software suites - 150+ users As of Spring 2014, wigner has been retired. The nodes are being used as a testing platform for upcoming Data Science program at Michigan Tech and to teach building and managing a research computing cluster as part of PH4395: Computer Simulations 13 wigner.research March 2011 — December 2013 - HPC Proving Grounds - OS installation and customization - Software compilation and integration with queueing system - Extensive testing of policies, procedures and user experience - PH4390, PH4395 and MA5903 students - Small to medium sized research groups - Automating systems administration - Integrating configuration files, logs, etc. with a revision control system 14 rocks.it.mtu.edu April 2012 — present - Central Rocks server (x86_64) - Serves 6.1, 6.0, 5.5, 5.4.3 and 5.4.2 - Saves time during installation - Facilitates inclusion of cluster-specific rolls Scripts and procedures were provided by Philip Papadopoulos 15 Superior June 2013 - 1 front end - 2 login nodes Compute nodes (CPU and GPU): Intel Sandy Bridge E5-2670 2.60 GHz 16 CPU cores and 64 GB RAM - 1 NAS node: 33 TB usable RAID60 storage space - 72 CPU compute nodes - 5 GPU compute nodes - 4 NVIDIA Tesla M2090 GPUs (448 CUDA cores) Housed in the newly built Great Lakes Research Center: http://www.mtu.edu/greatlakes/ 16 Superior June 2013 - 56 Gbps InfiniBand - Primary research network - Copper cables - Gigabit ethernet - Administrative and secondary research network - Redundant power supply for every component With 81 total nodes, there was 33% room for growth before needing to re-design the InfiniBand switch system. Final cost was $680k. Remaining $70k was used to build a test cluster: portage.research.mtu.edu 17 Superior June 2013 - Physical assembly (7 days) - Racking, cabling and labeling - Rocks Cluster Distribution (5 days) - OS installation, customization, compliance - Software compilation, user accounts - 3 pilot research groups (14 days) - Reward for being good and productive users - Help fix bugs, etc. 18 Superior June 2013 Ethernet switch system Front end Login nodes CPU Compute nodes InfiniBand switch system Storage node GPU compute nodes 19 Superior June 2013 - short.q (compute-0-N; N: 0-7) - 24 hour limit on run time - long.q (compute-0-N: N: 8-81) - No limit on run time - gpu.q (compute-0-N: N: 82-86) - No limit on run time http://superior.research.mtu.edu/available-resources 20 Benchmarks: HPL # June 2013 Performance (TFLOPS) Notes 1 Theoretical 23.96 -- 2 Practical 21.57 ~90% of #1 3 Measured 21.38 89.23% of #1 http://netlib.org/benchmark/hpl Theoretical performance = # of nodes x # of cores per node x Clock frequency (cycles/second) x # of floating point operations per cycle 21 Benchmarks: LAMMPS June 2013 Total Run Time (hours) 17 Michigan Tech: Superior NASA: Pleiades 12.75 8.5 4.25 0 2 (32) 4 (64) 6 (96) 10 (160) # of nodes (CPU cores) Benjamin Jensen (advisor: Dr. Gregory Odegard) Computational Mechanics and Materials Research Laboratory, Mechanical Engineering-Engineering Mechanics Results from a simulation involving 1,440 atoms and 500,000 time steps. 22 Account request - Résumé Proposal - Title and abstract User population Preliminary results Nature of data sets Required resources - List of software/compilers Scalability Source of funding Submit completed proposal to: ! Dr. Warren Perger Chair, HPC Committee wfp@mtu.edu LaTeX/MS Word template available at http://superior.research.mtu.edu/account-request 23 Why proposal? - A metric for merit - An easily accessible list of projects - Know what the facility is being used for - Intellectual scholarship and computational requirements - For VPR, CIO, deans, dept. chairs and institutional directors - A fail-safe opportunity to practice writing proposals seeking allocations in NSF’s XSEDE, etc. http://nsf.gov http://xsede.org http://superior.research.mtu.edu/list-of-projects 24 User population - Tier A - New faculty - Established faculty with funding - Tier B - Established faculty with no (immediate) funding Group members and external collaborators inherit their PI’s tier. New faculty status is valid for 2 years from the first day of work. 25 Job submission: qgenscript One stop shop for - Array jobs Exclusive node access Wait on pending jobs Email/SMS notifications Wait time statistics Command to submit the script Job information file http://superior.research.mtu.edu/job-submission/#batch-submission-scripts 26 Job scheduling policy - Users’ priorities are computed periodically - A weighted function of CPU time and production - In effect only when Superior is running at near 100% capacity - Pre-emption and advanced reservation are disabled - Any job that will start will run to completion http://superior.research.mtu.edu/job-submission/#scheduling-policy 27 Email/SMS notifications http://superior.research.mtu.edu/job-submission/#sms-notifications 28 Job information file 29 Running programs in login nodes - Reduces performance for all users - First offense - Terminates the program - An email notification [cc: user’s advisor] - Subsequent offenses - Same as first offense - Logs the user out and locks down the account http://superior.research.mtu.edu/job-submission/#running-programs-on-login-nodes A continued trend will be grounds for removal of user’s account. 30 Disk usage - Data is not backed up - Limits per user - /home/john - 25 MB /research/john - decided on a per proposal basis - When a user exceeds the limit - 12 reminders at 6 hour intervals [cc: user’s advisor] 13th reminder, logs out the user and locks down the account http://superior.research.mtu.edu/job-submission/#disk-usage 31 Useful commands - qgenscript - qresources - qlist - qnodes-map - qnodes-active | qnodes-idle - qwaittime - Developed at Michigan Tech http://superior.research.mtu.edu/job-submission/#useful-commands 32 qstatus | quser | qgroup qnodes-in-job qjobs-in-node qjobs-in-active-nodes qjobinfo | qjobcount qusage Usage reports All PIs and Chair of HPC Committee receive a weekly report. ! VPR, CIO, deans, department chairs and institutional directors receive quarterly and annual reports (or when necessary). 33 Usage reports July 2013 — December 2013 - 21 projects - 10 Tier A +11 Tier B - 100 users - 9 publications - 75+% busy on most days - $325k worth usage 34 ~50% of initial investment Cost recovery model: $0.10 per CPU-core per hour Metrics Cannot manage what cannot be measured Not everything that’s (easily) measurable is (really) meaningful Not everything that’s (really) meaningful is (easily) measurable 35 Metrics - Move towards a merit-based system - Easily measurable quantities - Who users are - # of CPUs and total CPU time - Really meaningful entities - Publications - Type (poster, conference proceeding, journal) and impact factor - Citations 36 Metrics: job priority System already knows who the users are An in-house algorithm to compute users’ priorities Publications reported to: ! Dr. Warren Perger Chair, HPC Committee wfp@mtu.edu 37 Metrics http://superior.research.mtu.edu/usage-reports Interactive visualizations are built using Highcharts framework 38 Metrics http://superior.research.mtu.edu/usage-reports 39 Metrics http://superior.research.mtu.edu/usage-reports 40 Metrics http://superior.research.mtu.edu/usage-reports 41 Metrics http://superior.research.mtu.edu/usage-reports 42 Metrics: global impact Michigan Tech original Journal Article Book Chapter MS Thesis PhD Dissertation http://superior.research.mtu.edu/list-of-publications 43 Conference Proceeding Further consolidation August 2013 — December 2013 - Move all clusters to Great Lakes Research Center - Upgrade to Rocks 6.1 and add a login node - Retire individual clusters when possible - 16 compute nodes and 1 NAS node added to Superior - portage.research.mtu.edu - Segue to Superior - 1 front end, 1 login node, 1 NAS node and 6 compute nodes - Testing, course work projects and beginner research groups 44 An as is snapshot January 2014 - 1 big, 1 mini (central) and 3 individual clusters - 1 data center with .research.mtu.edu network - Rocks 6.1 - Identical software configurations - Automated systems administration and maintenance - Extensive end user training - Complete documentation Immersive Visualization Studio (IVS) is powered by a Rocks 5.4.2 cluster and has 24 HD screens (46” 240 Hz LED) working in unison to create a 160 sq. feet display wall. 45 @MTUHPCStatus Immediate future February 2014 and beyond - More tools to enhance user experience - Videos for self-paced learning of command line linux - Encourage GPU computing - Expand storage - Provide backup - Re-design InfiniBand switch system (216 nodes) - Plan for expanded (or new) Superior 46 Thanks be to 47 Philip Papadopoulos and Luca Clementi (UCSD and SDSC) Timothy Carlson (PNL) Thomas Reuti Reuter (Phillips Universität Marburg) Alexander Chekholko (Stanford University) Rocks, Grid Engine and Ganglia mailing lists Henry Neeman (University of Oklahoma) Steven Gordon (The Ohio State University) Gergana Slavova, Walter Shands and Michael Tucker (Intel) Gaurav Sharma and Scott Benway (MathWorks) Adam DeConinck (NVIDIA)