Computing Environment

advertisement
Joint High Performance Computing Exchange (JHPCE)
Johns Hopkins School of Public Health & School of Medicine
Updated: May 28 2015
The Joint High Performance Computing Exchange (JHPCE) is a fee-for-service HPC facility in the
Department of Biostatistics. The facility is optimized for life-science and statistical computing. It is jointly
managed by the department of Biostatistics and the department of Molecular Microbiology & Immunology.
Computing and Storage resources
The primary computing resource consists of a heterogeneous cluster with 2640 64-bit cores and 19.8 TB
of DDR-SDRAM. “Fat nodes” have 64 cores and 512GB DDR-SDRAM. “Blade nodes” have 20 cores
and 128GB DDR-SDRAM. User access is via two login nodes. The cluster network fabric is 10Gbps
Ethernet.
Major statistical and mathematical computing packages are available, including R, SAS, STATA, Matlab,
and Mathematica. A community support model is used to maintain tools for a range of disciplines include
genomics, proteomics, epidemiology, medical imaging and biostatistics.
There is 2PB of networked storage. NFS storage devices provide nearly 1 PB of mass storage. This
includes approximately 230TB (formatted) on (mostly) ZFS storage appliances. An additional 670TB
resides on a custom-designed ZFS storage system. A Lustre-on-ZFS parallel file system provides an
additional 1.1PB of high-speed storage. The Lustre network fabric provides a theoretical bandwidth of
80Gbps to the 10Gbps compute cluster fabric. The custom-designed Lustre system was developed in
collaboration BioTeam and Intel. To our knowledge, and on a per TB basis, it is the lowest-cost, lowestpower parallel file system that has ever been constructed.
Data-transfer to and from the cluster is via scp, sftp, Globus or Aspera. Very high-speed data transfer is
supported via a dedicated transfer node that sits directly on three networks: 1) the 40Gbps Lustre
network, 2) the 10Gbps compute cluster fabric and finally, the 40Gbps uplink to the University’s 100Gbps
research network.
Backup
ZFS snapshots are performed on the ZFS systems mid-day Monday through Friday. Incremental tape
backups of home directories and selected file systems are performed nightly to IGM’s IBM Tivoli Storage
Management (TSM) system. Approximately 1/5 of the total storage (200TB) is backed up to TSM.
Server Rooms
The JHPCE facility is located in two server rooms in the School of Public Health building. All rooms are
accessed via key card + PIN number. The basement server room is monitored with video. Most racks
have been upgraded to 30amp 3-phase service. The 3rd floor server room has two 5-ton CRACs, the
basement server room has two 10-ton CRACs.
Management
The facility is run by a multi-departmental team led by Dr. Fernando Pineda (MMI), who provides
technical, financial and strategic leadership. Mr. Mark Miller (MMI) is the computing systems manager.
Mr. Jiong Yang (Biostat) is the systems engineer. Ms. Debra Moffit (Biostat) provides financial oversight.
The Biostat Information Technology committee provides advice and oversight and is chaired by Dr.
Pineda.
Security and HIPAA compliance
The JHPCE system can only be accessed via the secure shell (ssh) protocol. Either two-factor
authentication or public/private key-pair authentication is required. Access logs of user logins are
maintained and can be examined retrospectively. Multiple failed login attempts result in having the
offending IP address blocked. We have a strict policy of one-person/one-user. No sharing of user
accounts is permitted. Violations of this policy can result in a user being banned from the system. Either
ACLs or unix user and group permissions are used to enforce data access policies (depending on the
storage system). Access permissions are determined by project PIs in consultation with the JHPCE staff.
Policies and Cost recovery
Computing and storage resources are owned by stakeholders (e.g. departments, institutes or individual
labs). Job scheduling policies guarantee that stakeholders receive priority access to their nodes.
Stakeholders are required to make their excess capacity available to other users via a low-priority queue.
Stakeholders receive a reduction in their charges in proportion to the capacity that they share with other
users. This system provides a number of advantages: 1) stakeholders see a stable upper-limit on the
operating cost of their own resources, 2) stakeholders can buy surge capacity on resources owned by
other stakeholders, and 3) non-stakeholders obtain access to high-performance computing on a lowpriority basis with a pay-as-you-go model.
The SPH Dean and the department of Biostatistics both provide institutional support to the JHPCE.
Unsupported costs are recovered from users and stakeholders in the form of management fees. Since
2007 we have employed a systematic resource and cost sharing methodology that has largely eliminated
the administrative and political complexity associated with sharing of complex and dynamic resources.
Custom software implements dynamic cost-accounting and chargeback algorithms. Charges are
calculated monthly, but billed quarterly. Eight years of operation has demonstrated that it is a powerful
approach for fair and efficient allocation of human, computational and storage resources. In addition, the
methodology provides strong financial incentives for stakeholders to refresh their hardware.
Download