Fyzikální ústav AV ČR, v. v. i. Na Slovance 2 182 21 Praha 8 eli-cz@fzu.cz www.eli-beams.eu HPC - Computing cluster for the ELI project 1. General specifications Specifications HPC cluster Specifications solutions 1 Parameters Minimum Requirements Goal Compute nodes for High Performance Computing cluster Rack Rack 19” with rack mount kits Number of nodes min. 56 Architecture x86-64 Number of cores per At least 16 physical CPU cores (hyperthreading not taken into account) node 8 GB / per core, total cluster RAM min. 7.2 TB, ECC DDR3 1866 MHZ RAM (or faster) HDD At least 128 GB 2,5” SSD SATA III or SAS for OS and swap CPU Minimum SPECfp2006 rate baseline for one node: 530 GPU Without GPU MIC Without MIC Inf. Connection 1 x InfiniBand QDR or FDR port / per node LAN Connection 1 x 1Gbit RJ45 port / per node (with PXE booting support) OS Open source Linux compatible with CentOS, Scientific Linux or Debian Specifications Login and master nodes for HPC cluster Specifications solutions Parameters Minimum Requirements 2 Goal Login and master nodes for High Performance Computing Rack Rack 19” with rack mount kits Number of nodes Architecture CPU RAM Inf. Connection At least 3 (physical servers) x86-64 Same like compute nodes Same like compute nodes Same like compute nodes 1 x 10Gbit RJ45 port / per node and 1 x 1Gbit RJ45 port / per node (PXE booting support) Each node at least 2 local drives with capacity 500 GB, 15krpm, RAID1 Redundant, hot-swap version (power supply, RAID, etc.) LAN Connection HDD Redundancy OS Open source Linux compatible with CentOS, Scientific Linux or Debian Specifications Storage systems for HPC cluster Specifications solutions Parameters Minimum Requirements Goal Storage for High Performance Computing Rack Rack 19” with rack mount kits 3 SCRATCH system The storage system consists of HOME storage for user data and SCRATCH storage for temporary data and intermediate results At least 192 TB net usable capacity – actually usable by a user Actually achievable sustainable aggregate speed of sequential operations for 256KB block 800 MB/s for reading and 500 MB/s NFSv4 with Kerberos support At least 192 TB net usable capacity – actually usable by a user Actually achievable sustainable aggregate speed of sequential operations for 256KB block 1400 MB/s for reading and 800 MB/s Parallel file system (e.g. Lustre, GPFS or similar) Specifications Front-end for the storage system Storage system HOME capacity HOME speed HOME system SCRATCH capacity SCRATCH speed Specifications solutions Parameters Minimum Requirements 4 Goal Front-end servers for the HOME and SCRATCH storage systems Rack Rack 19” with rack mount kits Number of nodes At least 3 (physical servers), 1 active front-end for each storage system (HOME, SCRATCH) and at least one passive fail-over x86-64 Minimum SPECint2006 rate baseline for one node: 420 128 GB RAM ECC for each node Architecture CPU RAM Inf. Connection to the min. 2 x InfiniBand QDR or FDR (same like compute nodes) links HPC cluster 1 x 10Gbit RJ45 port / per node and 1 x 1Gbit RJ45 port / per node LAN Connection (PXE booting support) HDD Each node at least 2 local drives with capacity 300 GB, 10krpm, RAID1 Redundancy Redundant, hot-swap version (power supply, RAID, etc.) OS Open source Linux compatible with CentOS, Scientific Linux or Debian Specifications Infrastructure for HPC cluster Specifications solutions Parameters Goal Rack Dimension Connections – 5 Minimum Requirements Infrastructure for High Performance Computing The whole system must fit within 2 racks, which must be included in the offer together with rack mount kits. 42-48U, 600 or 800 x 1200 mm with cooling backdoor compatible with FzÚ (IoP) water cooling system - InfiniBand switches for connecting all nodes (compute, admin, 2|5 InfiniBand Connections – LAN network Power supply login and storage system front-end servers), InfiniBand connection between core switches - QDR or FDR InfiniBand technology - Cables must be included - LAN switch for connecting all nodes and data storage (management) - Internal network connection 1Gbit (metallic or fiber) - Outside connection through login and admin and storage frontend (HOME, SCRATCH) nodes min. 4 x 1Gbit RJ-45 (metallic) - Outside connection through login and admin and storage frontend (HOME, SCRATCH) nodes min. 4 x 10Gbit SFP+ (fiber) - full FzÚ (IoP) network compatible (LAN management and scripting) - Cables for internal connection must be included Maximum power supply of all HPC cluster parts at full operation (including compute nodes, whole storage system with front-end servers, switches, login and admin nodes, fans and all other electrical components) must be less than 40kW. The maximum power supply must be explicitly stated including the calculation of it, which should be done as follows: Add up the nameplate power (or the maximum power consumption provided by the manufacturer) of all anticipated components. If the manufacturer of the component does not state the wattage, it can be determined by multiplying the current (in Amps) by the voltage (in Volts) of the device to get the VA, which approximates the amount of watts the device will consume. Additional parameters of the HPC cluster: 1. Data speed/capacity are stated using the units 1 TB = 1000000000000 bytes 1 GB = 1000000000 bytes 1 MB = 1000000 bytes 1 Gbit = 1000000000 bits 2. The hardware components must be identical in all compute nodes (including memory modules). 3. Redundancy is required if some hardware components are shared by several compute nodes. Namely, no more than 2 nodes can fail in the case of failure of a single hardware component. In the case of blade servers, it must be possible to replace individual components (switches, servers, power supply, etc.) during operation. 4. All memory channels of all processors must be filled. The same number of DIMM modules must occupy each channel. All DIMMs in all nodes must be identical. 5. MLC technology is acceptable for internal node SSD drives. The linear read and write speed must be at least 500MB/s. Each SSD drive must provide at least 50000 IOPS for random read and write. 6. Network operation system booting must be supported as well as local booting from external drive. It must be possible to set the sequence of booting devices. 7. All compute nodes must be connected through InfiniBand in the non-blocking fat tree configuration. 8. Access to the console of each node must be provided through one central location (single monitor + keyboard). 9. The mainboard must contain management controller - BMC compatible with IPMI 2.0 or higher, remote power management, monitoring fans and CPU and mainboard temperatures. 10. All HW components must be supported in the kernel or using external driver with source code provided. 11. The software installed on the HPC cluster must include compilers – gcc and Intel, libraries OpenMPI, MVAPICH2, Intel MPI. Module environment is required. 12. Open source or proprietary software for system management and administration, scalable distributed computing management and provisioning tools that provide a unified interface for hardware control, 3|5 discovery, and OS diskfull/diskfree deployment. In the case of proprietary software, the price of the software its license, support including solving software conflicts and usability problems, and supplying updates and patches for bugs and security holes for at least 3 years, must be included. The documentation for all the software must be included and must be in English. 13. The results of performance tests must be supplied. The performance can be demonstrated providing official results from www.spec.org on equivalent system or by running the benchmark on one of the supplied compute nodes. 14. The supplier must verifiably and reproducibly demonstrate that the cluster meets the specified performance parameters during the acceptance tests. 15. In case of failure to achieve the specified performance, the supplier will have an option to optimize HW or SW so that the system reaches stated performance, but the acceptance protocol will not be signed until the stated performance is achieved. Additional specification of the storage system: 1. Each part of the storage system (HOME+SCRATCH) must be connected to the cluster through its own front-end server. Another front-end server is required as a fail-over. 2. The front-end servers must have identical hardware and must be integrated in the InfiniBand infrastructure. 3. The front-end servers InfiniBand and 10 Gbit interface can be on the same card, but it must be possible to use both of them at the same time. 4. Access to the console of each front-end server must be provided through one central location (single monitor + keyboard). 5. The mainboards of front-end servers must contain management controller - BMC compatible with IPMI 2.0 or higher, remote power management, monitoring fans and CPU and mainboard temperatures. 6. All HW components of all front-end servers must be supported in the kernel or using external driver with source code provided. 7. In the case of HOME storage, the front-end server must export NFSv4 and support Kerberos authentication. NFSv4 can reexport other filesystem. 8. Data speed/capacity are stated using the units 1 TB = 1000000000000 bytes 1 GB = 1000000000 bytes 1 MB = 1000000 bytes 1 Gbit = 1000000000 bits 9. The determination of the net usable capacity of a data storage solution must be stated for the proposed/delivered configuration designed for the standard operation and must not be based on presumptions, which cannot be ensured or which restrict the use of the data storage or other data storage solutions or which do not comply with the requirements or possible interests of the Client. 10. The determination of the net usable capacity must not count on or take into account system features or its components as potential additional space for data storage based upon presumptions, which cannot be ensured (compression, deduplication, etc.) or to allocate more space than it is physically possible or actually feasible without the need for other actions (oversubscription). 11. The tools, solutions used to determine the capacity must provide credible information and must work with a known size of a data block or a known and accurate unit. 12. The determination of data storage speed must be stated for the proposed/delivered configuration designed for standard operation (with full storage capacity). It must not be based on presumptions which cannot be ensured or which restrict the use of the data storage or other data storage solutions or which do not comply with the requirements or possible interests of the Client. For example the performance of HOME data storage must not be influenced in any manner by the use of SCRATCH data storage. 13. The determination of speed must not be based on a presumption of specific favourable conditions or a specific favourable measurement mode (e.g. cache operations), unless such conditions or a mode are explicitly required or stated. 14. Data storage solution components – disks, power supplies, RAID’s, switches, servers must be replaceable during operation without causing any failure of the data storage operation. 15. High density of the storage system, average density at least 5TBit per 1U including all components of the storage system. 16. The HOME storage solution’s disk fields must ensure data protection. RAID6 in configuration 16+2 (or better) or using equivalent technology with the same level of protection (number of parity drives). 4|5 17. The SCRATCH storage solution’s disk fields must ensure data protection. RAID5 in configuration 16+1 (or better) or using equivalent technology with the same level of protection (number of parity drives). 18. The HOME storage RAID6 array may consist of groups connected together on the front-end server, but all groups must have the same configuration and each RAID group must be realized using external controller. 19. The SCRATCH storage RAID5 array may consist of groups connected together on the front-end server, but all groups must have the same configuration and each RAID group must be realized using external controller. 20. At least 4 GB write-back cache is required for all hardware RAID controllers. 21. The configuration of HOME storage must allow rebuild within 48 hours during standard operation (performance decrease is acceptable). 22. At least 6 hot spare drives must be included. 23. All hard drives in the HOME and SCRATCH storage must be of the same type and size. 24. Components of network SCRATCH and HOME can be shared, but it is necessary to maintain the desired performance parameters while testing both part of the storage system at once. 25. The speed of the HOME and SCRATCH storage systems will be measured for writing of a large data file from 8 clients from a single compute node. It will be determined using: iozone -t 8 -Mce –s1000g -r256k -i0 -i1 –F file1 file2 file3 file4 file5 file6 file7 file8 Important are the results „Children see throughput for 8 initial writers“ (for writing) and „Children see throughput for 8 readers“ (for reading). The results of all tests must be supplied for iozone version 3.347 (http://www.iozone.org). 26. Open source or proprietary are for storage system management and administration including filesystems. In the case of proprietary software, the price of the software its license, support including solving software conflicts and usability problems, and supplying updates and patches for bugs and security holes for at least 3 years, must be included. The documentation for all the software must be included and must be in English. 27. The supplier must verifiably and reproducibly demonstrate that the cluster meets the specified performance parameters during the acceptance tests. 28. In case of failure to achieve the specified performance, the supplier will have an option to optimize HW or SW so that the system reaches stated performance, but the acceptance protocol will not be signed until the stated performance is achieved. 5|5