DSCC 201/401 Tools and Infrastructure for Data Science September 11, 2023 Office Hours • Office Hours are in Wegmans 1219 • Ethan Leung: Tuesday 5:00 - 6:00 pm • Carol Li: Wednesday 3:30 - 4:30 pm • Ziyu Zhao: Friday 5:00 - 6:00 pm 2 Hardware Resources for Data Science • Supercomputers • Cluster Computing • Virtualization and Cloud Computing 3 CPU • CPU = Central Processing Unit Arithmetic Logic Unit Instruction Memory Control Unit Input/Output 4 Data Memory Intel Phi • Many integrated core (MIC) architecture • Introduced in 2013 and x86 compatible • Goal is to provide many cores at a slower clock speed (opposite of initial driver for standard CPUs) • X100 Series - Introduced as PCIe card (e.g. Phi 5110P - 60 cores, 1.0 GHz, 1.0 TF (DP)) • Evolved to exist as stand alone chips - Knights Landing (72 cores, 1.5 GHz, 3.5 TF), Knights Hill (canceled) • Much of the development of Intel Phi has been integrated into the latest server-class CPUs (starting with Skylake) 5 Trinity: Los Alamos National Laboratory Cray XC30: Intel Haswell + Knights Landing GPU • GPU = Graphics Processing Unit • "A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device" • Originally developed for graphics and high-end video gaming systems (and continues to be developed) • Eventually extended for general purpose computing using programming models to access and control a GPU - starting around 2007 - CUDA introduced (Compute Unified Device Architecture) 7 GPU • GPU is a coprocessor to CPU and tied to at least one core • GPU cores are many more than processor (e.g. 2,688) • GPU RAM is smaller than CPU RAM (e.g. 8 GB) CPU GPU PCIe Bus Cache GPU Memory (RAM) Memory (RAM) 8 GPU • What does a GPU look like? • Expansion card or integrated onto board 9 GPU • What does a GPU look like? • PCIe (Peripheral Component Interconnect Express) card • Each device has a grid of blocks and each block has shared memory and threads Device Grid Block Block Block Block Block Block Global Memory 10 GPU Specifications • Nvidia GPUs designed for high-performance computing are referred to as Tesla • Nvidia Tesla has many major generations of GPUs for computing: Fermi, Kepler, Pascal, Volta, Ampere, and Hopper GPU Generation CUDA Cores GPU RAM TF (DP) C2050 Fermi 448 6 GB 0.5 K20 Kepler 2496 5 GB 1.2 K20X Kepler 2688 6 GB 1.3 K80 Kepler 4992 24 GB 2.9 P100 Pascal 3584 16 GB 4.7 V100 Volta 5120 16 GB 7.0 A100 Ampere 6912 40 GB 9.7 H100 Hopper 14592 80 GB 24 11 GPU - Programming Model and PCIe vs. NVLINK • Multiple GPU cards can be placed in a host computer • Communication through PCIe bus (P2P) 12 GPU - Programming Model and PCIe vs. NVLink • Multiple GPU cards can be placed in a host computer • NVLink is available to Pascal, Volta, Ampere, and Hopper class GPUs in systems that support it - much better performance (but expensive!) 13 GPU - Nvidia DGX-1 14 GPU - Nvidia DGX A100 15 Summit: Oak Ridge National Laboratory - 149 PF IBM Power 9 + Nvidia V100 GPU 2 CPUs + 6 GPUs Calculating Performance • Theoretical performance is calculated from chip architecture and clock speed • Most common metric is based on calculation of double-precision floating-point numbers (i.e. double in C++) - 64 bits • FLOPS = FLoating point OPerations per Second • We need to consider what type of floating point operation per second! Name Abreviation Memory (Bytes) Double Precision DP 8 64 FP64 Single Precision SP 4 32 FP32 Half Precision HP 2 16 FP16 17 Bits Name Floating Point Precision sign (1) FP64 fraction (52) 3.14159265359 exponent (11) sign (1) FP32 fraction (23) 3.14159 exponent (8) FP16 fraction (10) sign (1) 3.14 exponent (5) 63 31 18 15 0 Note: Values of pi only shown for illustration purposes. GPU Acceleration for Machine Learning • Machine learning algorithms generally do not need double precision • GPUs provide additional acceleration when high precision is not required GPU/CPU Generation Cores TF (DP) TF (SP) TF (HP) C2050 Fermi 448 0.5 1.0 K20X Kepler 2,688 1.3 3.9 K80 Kepler 4,992 2.9 8.7 P100 Pascal 3,584 4.7 9.3 18.7 V100 Volta 5,120 7.0 14.0 125 A100 Ampere 6,912 9.7/19.5 19.5 312 (624) H100 Hopper 14,592 26/51 51 800 (1513) 20 1.3 2.6 5.2 Intel Xeon Gold 6230 (Cascade Lake) 2.1 GHz 19 TPU Acceleration for Machine Learning • Google’s Tensor Processing Unit (TPU) - 90 TF (HP) • Available on Google Cloud Platform • TPU vs. ASIC (Application Specific Integrated Circuit) 20 GPU Acceleration for Machine Learning • AMD is working on competitors based on Radeon Instinct • AMD focusing programming efforts on OpenCL (Open Computing Language) in contrast to Nvidia’s CUDA GPU or CPU Cores TF (DP) TF (SP) TF (HP) MI250X 14,080 47.9 47.9 383 MI100 7,680 11.5 23.1 185 MI60 4,096 7.4 14.7 29.5 Intel Xeon Gold 6230 2.1 GHz 20 1.3 2.6 5.2 21 22 23 Selene: Nvidia Corporation (USA) - 63 PF Nvidia DGX A100 Superpod with Nvidia A100 GPU GPUs - More Than Just For Supercomputers • Nvidia Tesla line is designed for supercomputers and server-class architectures • Nvidia Tegra line is designed for mobile devices and embedded systems • Google's Edge TPU is another device designed for embedded and mobile systems • AI application - computer vision for cars, robotics, etc. GPU Generation CUDA Cores GPU RAM TF (SP) TF (HP) K1 Kepler 192 8 GB 0.4 - X1 Maxwell 256 8 GB 0.5 1.0 X2 Pascal 256 8 GB 0.8 1.5 Xavier Volta 512 16 GB 1.4 2.8 Orin Ampere 2048 64 GB 5.3 10.6 Cluster Computing 26 What is a Linux Cluster? • A group of computers linked by a high-speed interconnect that can act as a large system for big computations and data processing • Group works closely together and has the appearance of a single computer • Runs an operating system that uses the Linux kernel • Uses software to control computational tasks 27 28 How is a Linux Cluster Different from Other Supercomputers? • A Linux cluster is a type of supercomputer • Typically constructed from "commodity" server hardware • Linux clusters have a more customizable system architecture (processor, memory, interconnect) • Usually deployed with "less proprietary" designs and configurations and can be constructed with smaller discrete units (e.g. node vs. rack) • We will examine Linux clusters in the context of our own here at the University of Rochester, known as BlueHive 29 Linux Clusters - Major Components • Computing • Storage • Network • Software 30 Some Hardware Definitions • Storage - permanent data storage (hard drive or solid state drive), does not go away when system is powered off • Memory - usually refers to RAM (random access memory), data goes away when system is powered off (volatile) • Processor - a computing chip that processes data (usually refers to a CPU) • CPU - central processing unit, the chip that does the computing, also know as a processor • Socket - a physical location on a computer board that can house a CPU • Core - a computing element of a CPU that can process data independently, most CPUs today have multiple cores • Node - a physical computing unit (e.g. server) with sockets that have one or more processors, banks of memory, and a network interface 31 Linux Cluster Hardware - Compute • Compute nodes are typically dense servers that are responsible for processing in a Linux cluster • Compute nodes are stacked and placed in standard 19-inch wide 42U high racks • Each server often has 2 sockets with RAM and local disk • Each socket has 1 CPU and that CPU has many cores • At least 1 node in the cluster is a login node and 1 node in the cluster is is a service node • Login node is where users log in to system and submit jobs • Service node controls the system (not user accessible) 32 Linux Cluster Hardware - Compute 33 Linux Cluster Hardware - Compute e.g. Dell C6300: 2 sockets, RAM, hard drive 34 Linux Cluster Hardware - Compute PS CN CN CN PS CN CN 2U ~3.5 in e.g. Dell C6300: 4 compute nodes in 2U 35 Linux Cluster Hardware - Compute PS PS CN CN CN CN 42U Rack ~6.5 ft. 19 inch width rails 36 Linux Cluster Hardware - Storage • Individual hard drives on compute nodes - BUT typically not enough! • How do we make sure all files are accessible by any one of the compute nodes? • Clustered file system allows a file system to be mounted on several nodes • Network attached storage (NAS) can be mounted on nodes using NFS (network file system) - similar to "network drive" • Better performance and redundancy is achieved through parallel file systems, which is a special type of clustered file system • Parallel file systems provide concurrent high-speed file access to applications executing on multiple nodes of clusters • Efficiency and redundancy is improved by allowing nodes to access block level (which is a lower level than the file level) 37 Parallel File Systems • Lustre • Developed by Cluster File Systems, Inc. (but now open source) • Uses metadata servers, object storage servers, and client servers for the file system • Spectrum Scale (i.e. GPFS - General Parallel File System) • Developed by IBM (proprietary) • Parallel file system that provides access to block level storage on multiple nodes • Blocks are distributed across multiple disk arrays with redundancy (declustered RAID - Redundant Array of Independent Disks) • Metadata (i.e. information about the files and layout) are distributed across the multiple disk arrays 38 Spectrum Scale (GPFS) JBOD JBOD NSD Servers JBOD JBOD JBOD JBOD NSD Servers • JBOD (Just a Bunch of Disks) provides the actual storage - typically 60 disks • NSD (Network Shared Disk) Servers share out the storage to the clients through a network connection • GPFS server and client software JBOD JBOD 42U Rack 39 Linux Clusters - Networking • Ethernet can be used but has high latency (e.g. 50-125 µs) due to the TCP/IP protocol • Small packets of information take a long time to reach destination • InfiniBand has been designed for low latency (and high bandwidth) typically less than 5 µs • FDR10 (10 Gb/s), FDR (14 Gb/s), EDR (25 Gb/s), and HDR (50 Gb/s) are commonly used today • Links can be aggregated for extra bandwidth (e.g. 4X, 8X, etc.) • BlueHive uses 4X aggregation of EDR (i.e. 100 Gb/s) and 4X aggregation of FDR10 (i.e. 40 Gb/s bandwidth) • Copper cables can be used for InfiniBand lengths less than 10 meters (otherwise optical fiber cables are used) • Network switches can be be used to link all components together 40 BlueHive 41 42 Linux Cluster Hardware • Necessary hardware components: Compute Nodes, Storage, Networking • Also need a login node to provide a system for users to log in and interact with the system • Also need a service node to provide a place to run and manage the control and monitoring software for the system 43 Linux Clusters - Software • Operating system • Management and monitoring software • Job scheduling and launching software • User software 44 Operating System • >99% are based on Linux • Beneficial for user multi-user management in shared environment • Great for security and permissions (file, executable, etc.) • Support for open source codes • Excellent community support 45 Management and Monitoring Software • XCat (Extreme Cluster Administration Toolkit) • Provision OS and node images • Remotely manage systems - reboot and distributed shell • Ganglia • Node use and monitoring system • Shows CPU, RAM, and I/O usage of nodes • Can be user accessible • Zabbix • Server monitoring software • Shows node stability and resource usage (CPU, RAM, and I/O) • Typically not user accessible 46 Job Scheduling and Managing Software • Torque/Maui • SLURM 47 Job Scheduling and Managing Login Node User 1 User 2 User 3 • Torque/Maui • Torque manages the resources PBS Server (Torque) • Maui is the scheduler Maui Scheduler Service Node Node 0 Run Script 48 Node 1 Node N Job Scheduling and Managing • Slurm (Simple Linux Utility for Resource Management) • Very similar to Torque/Maui • Provides necessary software for starting, stopping, and monitoring compute jobs and resources • Allocates and deallocates computing resources on nodes and partitions • Manages queues of jobs waiting for resources 49 Partitions • Most multi-user clusters are divided into partitions or subunits that allow jobs with similar characteristics to run in a common environment • BlueHive is divided into partitions • Similar hardware attributes and limitations of running jobs • User selects which partition to run job 50