High Performance Cluster (HPC) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. The Emory Life Physical Sciences cluster (ELLIPSE), a 256 node, 1024 CPU, highperformance Sun computing cluster is the latest arrival in the Emory High Performance Compute Cluster (EHPCC), a subscription-based, shared resource for the University community that is managed by the High Performance Computing (HPC) Group. The addition of ELLIPSE to existing campus computational resources, such as those at the Biomolecular Computing Resource (BIMCORE) and Cherry Emerson Center for Scientific Computation, moves the University into the top tier of institutions for conducting computational investigations in neural simulation, genomic comparison, biological sequence analysis, statistical research, algorithm research and development, and numerical analysis. Computing clusters provide a reasonably inexpensive method to aggregate computing power and dramatically cut the time needed to find answers in research that requires the analysis of vast amounts of data. Eight and one-half hours of ELLIPSE operation is equivalent to an entire year of 24-hour days on a fast desktop, and four to five days is equivalent to two months on its 128 CPU predecessor, which is still in service. Additional Information: Keven Haynes, khaynes@emory.edu HPC FAQs: What does High Performance Computing (HPC) mean? Computing used for scientific research Performs highly calculation-intensive tasks (e.g., weather forecasting, molecular modeling, string matching) A large collection of computers, connected via high speed network or fabric Uses multiple CPUs to distribute computational load, aggregate Input/Output Computation runs in parallel Work managed via a “job scheduler” HPC Cluster 256 dual-core, dual-socket AMD Opteron-based compute nodes - 1024 cores total 8 GB RAM/node, 2 GB RAM/core 250 GB local storage per node ~ 8 TB global storage (parallel file system) Gigabit Ethernet, with separate management network 11 additional servers Cluster Nodes 256 Sun x2200s AMD Opteron 2218 processors CentOS 4 Linux (whitebox Red Hat) 8 GB DDR2 RAM, except “Fat” Nodes with 32 GB RAM, local 250 GB SATA drive Single gigabit data connection to switch Global file system (IBRIX) mounted Networking Separate Data and Management networks Data Network: Foundry BI-RX 16 Management network: 9 Foundry stackables MRV console switches Storage Global, parallel file system: IBRIX Sun StorEdge 6140, five trays of 16 15Krpm FC drives, connected via 4 GB fibre connections. Five Sun x4100 file-system servers: one IBRIX Fusion Mgr, four Segment servers w/four bonded ethernet connections. IBRIX file system Looks like an ext3 file system, because it is (not NFS 4) - Segmenting ext3. Scales (horizontally) to thousands of servers, hundreds of petabytes Efficient with both small and large I/O Partial online operation, dynamic load balancing Will run on any Linux hardware The scheduler Users submit work to cluster via SGE (‘qsub’ command) and ssh SGE can manage up to 200,000 job submissions Distributed Resource Management (DRM) Policy-based resource allocation algorithms (queues) Applications MATLAB Geant4 Genesis (Neuroscience) Supported Soon: o iNquiry (BioInformatics) o Gcc compilers (soon: PGI compilers) Performance Estimated ~3 Teraflops at 80% efficiency (theoretical) Achieved 2 GB/sec writes over the network 10 minutes of cluster operation = ~7 days on a fast desktop 8.5 hours -> entire year of 24-hour days Rate Structure The "long.q" cluster queue ($.025/CPU hour) Currently spans all compute nodes All users can access this queue No hard or soft job/time limits 1 job slot per compute node at this time All jobs are run with a Unix NICE level of +15 Queue suspend threshold of NP_LOAD_AVG=1.75 Queue instance will close/subordinate if there are 1 or more express jobs running Queue instance will close/subordinate if there are 2 or more general jobs running The "all.q" cluster queue ($.03/CPU hour) This is the default or "general" queue Spans all compute nodes No access control list Queue suspend threshold of NP_LOAD_AVG=1.75 4 job slots per compute node at this time 12 hour hard limit (h_cpu = 43200) on all jobs Queue instance will close/subordinate if there is 1 or more express job running The "express.q" cluster queue ($ .05/CPU hour) Spans all compute nodes Has an access control list (currently only expressUsers user list can access) Queue suspend threshold of NONE 2 job slots per compute node at this time hour hard limit (h_cpu = 7200) on all jobs Not subordinate to any other queue Additional Resources BIMCORE http://www.bimcore.emory.edu/ Contact: Steve Pittard, wsp@emory.edu Cherry Emerson Center for Scientific Computing http://www.emerson.emory.edu/ Contact: