Uploaded by Zachary Garson

DSCC 201 401 Fall 2023 Lecture 3

advertisement
DSCC 201/401
Tools and Infrastructure for
Data Science
September 11, 2023
Office Hours
• Office Hours are in Wegmans 1219
• Ethan Leung: Tuesday 5:00 - 6:00 pm
• Carol Li: Wednesday 3:30 - 4:30 pm
• Ziyu Zhao: Friday 5:00 - 6:00 pm
2
Hardware Resources for Data Science
• Supercomputers
• Cluster Computing
• Virtualization and Cloud Computing
3
CPU
• CPU = Central Processing Unit
Arithmetic Logic
Unit
Instruction
Memory
Control Unit
Input/Output
4
Data Memory
Intel Phi
• Many integrated core (MIC) architecture
• Introduced in 2013 and x86 compatible
• Goal is to provide many cores at a slower clock speed (opposite of
initial driver for standard CPUs)
• X100 Series - Introduced as PCIe card (e.g. Phi 5110P - 60 cores, 1.0
GHz, 1.0 TF (DP))
• Evolved to exist as stand alone chips - Knights Landing (72 cores, 1.5
GHz, 3.5 TF), Knights Hill (canceled)
• Much of the development of Intel Phi has been integrated into the latest
server-class CPUs (starting with Skylake)
5
Trinity: Los Alamos National Laboratory
Cray XC30: Intel Haswell + Knights Landing
GPU
• GPU = Graphics Processing Unit
• "A specialized electronic circuit designed to rapidly manipulate and alter
memory to accelerate the creation of images in a frame buffer intended
for output to a display device"
• Originally developed for graphics and high-end video gaming systems
(and continues to be developed)
• Eventually extended for general purpose computing using programming
models to access and control a GPU - starting around 2007 - CUDA
introduced (Compute Unified Device Architecture)
7
GPU
• GPU is a coprocessor to CPU and tied to at least one core
• GPU cores are many more than processor (e.g. 2,688)
• GPU RAM is smaller than CPU RAM (e.g. 8 GB)
CPU
GPU
PCIe Bus
Cache
GPU Memory
(RAM)
Memory
(RAM)
8
GPU
• What does a GPU look like?
• Expansion card or integrated onto board
9
GPU
• What does a GPU look like?
• PCIe (Peripheral Component Interconnect Express) card
• Each device has a grid of blocks and each block has shared memory
and threads
Device
Grid
Block Block Block
Block Block Block
Global Memory
10
GPU Specifications
• Nvidia GPUs designed for high-performance computing are referred to
as Tesla
• Nvidia Tesla has many major generations of GPUs for computing:
Fermi, Kepler, Pascal, Volta, Ampere, and Hopper
GPU
Generation
CUDA Cores
GPU RAM
TF (DP)
C2050
Fermi
448
6 GB
0.5
K20
Kepler
2496
5 GB
1.2
K20X
Kepler
2688
6 GB
1.3
K80
Kepler
4992
24 GB
2.9
P100
Pascal
3584
16 GB
4.7
V100
Volta
5120
16 GB
7.0
A100
Ampere
6912
40 GB
9.7
H100
Hopper
14592
80 GB
24
11
GPU - Programming Model and PCIe vs. NVLINK
• Multiple GPU cards can be placed in a host computer
• Communication through PCIe bus (P2P)
12
GPU - Programming Model and PCIe vs. NVLink
• Multiple GPU cards can be placed in a host computer
• NVLink is available to Pascal, Volta, Ampere, and Hopper class GPUs in
systems that support it - much better performance (but expensive!)
13
GPU - Nvidia DGX-1
14
GPU - Nvidia DGX A100
15
Summit: Oak Ridge National Laboratory - 149 PF
IBM Power 9 + Nvidia V100 GPU
2 CPUs + 6 GPUs
Calculating Performance
• Theoretical performance is calculated from chip architecture and clock
speed
• Most common metric is based on calculation of double-precision
floating-point numbers (i.e. double in C++) - 64 bits
• FLOPS = FLoating point OPerations per Second
• We need to consider what type of floating point operation per second!
Name
Abreviation
Memory
(Bytes)
Double Precision
DP
8
64
FP64
Single Precision
SP
4
32
FP32
Half Precision
HP
2
16
FP16
17
Bits
Name
Floating Point Precision
sign (1)
FP64
fraction (52)
3.14159265359
exponent (11)
sign (1)
FP32
fraction (23)
3.14159
exponent (8)
FP16
fraction (10)
sign (1)
3.14
exponent (5)
63
31
18
15
0
Note: Values of pi only shown for illustration purposes.
GPU Acceleration for Machine Learning
• Machine learning algorithms generally do not need double precision
• GPUs provide additional acceleration when high precision is not required
GPU/CPU Generation
Cores
TF (DP)
TF (SP)
TF (HP)
C2050
Fermi
448
0.5
1.0
K20X
Kepler
2,688
1.3
3.9
K80
Kepler
4,992
2.9
8.7
P100
Pascal
3,584
4.7
9.3
18.7
V100
Volta
5,120
7.0
14.0
125
A100
Ampere
6,912
9.7/19.5
19.5
312 (624)
H100
Hopper
14,592
26/51
51
800 (1513)
20
1.3
2.6
5.2
Intel Xeon Gold 6230
(Cascade Lake) 2.1 GHz
19
TPU Acceleration for Machine Learning
• Google’s Tensor Processing Unit (TPU) - 90 TF (HP)
• Available on Google Cloud Platform
• TPU vs. ASIC (Application Specific Integrated Circuit)
20
GPU Acceleration for Machine Learning
• AMD is working on competitors based on Radeon Instinct
• AMD focusing programming efforts on OpenCL (Open Computing
Language) in contrast to Nvidia’s CUDA
GPU or CPU
Cores
TF (DP)
TF (SP)
TF (HP)
MI250X
14,080
47.9
47.9
383
MI100
7,680
11.5
23.1
185
MI60
4,096
7.4
14.7
29.5
Intel Xeon
Gold 6230
2.1 GHz
20
1.3
2.6
5.2
21
22
23
Selene: Nvidia Corporation (USA) - 63 PF
Nvidia DGX A100 Superpod with Nvidia A100 GPU
GPUs - More Than Just For Supercomputers
• Nvidia Tesla line is designed for supercomputers and server-class
architectures
• Nvidia Tegra line is designed for mobile devices and embedded systems
• Google's Edge TPU is another device designed for embedded and mobile
systems
• AI application - computer vision for cars, robotics, etc.
GPU
Generation CUDA Cores GPU RAM
TF (SP)
TF (HP)
K1
Kepler
192
8 GB
0.4
-
X1
Maxwell
256
8 GB
0.5
1.0
X2
Pascal
256
8 GB
0.8
1.5
Xavier
Volta
512
16 GB
1.4
2.8
Orin
Ampere
2048
64 GB
5.3
10.6
Cluster Computing
26
What is a Linux Cluster?
• A group of computers linked by a high-speed interconnect that can act
as a large system for big computations and data processing
• Group works closely together and has the appearance of a single
computer
• Runs an operating system that uses the Linux kernel
• Uses software to control computational tasks
27
28
How is a Linux Cluster Different from Other
Supercomputers?
• A Linux cluster is a type of supercomputer
• Typically constructed from "commodity" server hardware
• Linux clusters have a more customizable system architecture
(processor, memory, interconnect)
• Usually deployed with "less proprietary" designs and configurations and
can be constructed with smaller discrete units (e.g. node vs. rack)
• We will examine Linux clusters in the context of our own here at the
University of Rochester, known as BlueHive
29
Linux Clusters - Major Components
• Computing
• Storage
• Network
• Software
30
Some Hardware Definitions
• Storage - permanent data storage (hard drive or solid state drive),
does not go away when system is powered off
• Memory - usually refers to RAM (random access memory), data goes
away when system is powered off (volatile)
• Processor - a computing chip that processes data (usually refers to a
CPU)
• CPU - central processing unit, the chip that does the computing, also
know as a processor
• Socket - a physical location on a computer board that can house a CPU
• Core - a computing element of a CPU that can process data
independently, most CPUs today have multiple cores
• Node - a physical computing unit (e.g. server) with sockets that have
one or more processors, banks of memory, and a network interface
31
Linux Cluster Hardware - Compute
• Compute nodes are typically dense servers that are responsible for
processing in a Linux cluster
• Compute nodes are stacked and placed in standard 19-inch wide 42U
high racks
• Each server often has 2 sockets with RAM and local disk
• Each socket has 1 CPU and that CPU has many cores
• At least 1 node in the cluster is a login node and 1 node in the cluster is
is a service node
• Login node is where users log in to system and submit jobs
• Service node controls the system (not user accessible)
32
Linux Cluster Hardware - Compute
33
Linux Cluster Hardware - Compute
e.g. Dell C6300: 2 sockets, RAM, hard drive
34
Linux Cluster Hardware - Compute
PS
CN
CN
CN
PS
CN
CN
2U
~3.5 in
e.g. Dell C6300: 4 compute nodes in 2U
35
Linux Cluster Hardware - Compute
PS
PS
CN
CN
CN
CN
42U Rack
~6.5 ft.
19 inch width rails
36
Linux Cluster Hardware - Storage
• Individual hard drives on compute nodes - BUT typically not enough!
• How do we make sure all files are accessible by any one of the compute
nodes?
• Clustered file system allows a file system to be mounted on several
nodes
• Network attached storage (NAS) can be mounted on nodes using NFS
(network file system) - similar to "network drive"
• Better performance and redundancy is achieved through parallel file
systems, which is a special type of clustered file system
• Parallel file systems provide concurrent high-speed file access to
applications executing on multiple nodes of clusters
• Efficiency and redundancy is improved by allowing nodes to access
block level (which is a lower level than the file level)
37
Parallel File Systems
• Lustre
• Developed by Cluster File Systems, Inc. (but now open source)
• Uses metadata servers, object storage servers, and client servers for
the file system
• Spectrum Scale (i.e. GPFS - General Parallel File System)
• Developed by IBM (proprietary)
• Parallel file system that provides access to block level storage on
multiple nodes
• Blocks are distributed across multiple disk arrays with redundancy
(declustered RAID - Redundant Array of Independent Disks)
• Metadata (i.e. information about the files and layout) are distributed
across the multiple disk arrays
38
Spectrum Scale (GPFS)
JBOD
JBOD
NSD Servers
JBOD
JBOD
JBOD
JBOD
NSD Servers
• JBOD (Just a Bunch of Disks) provides the
actual storage - typically 60 disks
• NSD (Network Shared Disk) Servers share out
the storage to the clients through a network
connection
• GPFS server and client software
JBOD
JBOD
42U Rack
39
Linux Clusters - Networking
• Ethernet can be used but has high latency (e.g. 50-125 µs) due to the
TCP/IP protocol
• Small packets of information take a long time to reach destination
• InfiniBand has been designed for low latency (and high bandwidth) typically less than 5 µs
• FDR10 (10 Gb/s), FDR (14 Gb/s), EDR (25 Gb/s), and HDR (50 Gb/s)
are commonly used today
• Links can be aggregated for extra bandwidth (e.g. 4X, 8X, etc.)
• BlueHive uses 4X aggregation of EDR (i.e. 100 Gb/s) and 4X
aggregation of FDR10 (i.e. 40 Gb/s bandwidth)
• Copper cables can be used for InfiniBand lengths less than 10 meters
(otherwise optical fiber cables are used)
• Network switches can be be used to link all components together
40
BlueHive
41
42
Linux Cluster Hardware
• Necessary hardware components: Compute Nodes, Storage,
Networking
• Also need a login node to provide a system for users to log in and
interact with the system
• Also need a service node to provide a place to run and manage the
control and monitoring software for the system
43
Linux Clusters - Software
• Operating system
• Management and monitoring software
• Job scheduling and launching software
• User software
44
Operating System
• >99% are based on Linux
• Beneficial for user multi-user management in shared environment
• Great for security and permissions (file, executable, etc.)
• Support for open source codes
• Excellent community support
45
Management and Monitoring Software
• XCat (Extreme Cluster Administration Toolkit)
• Provision OS and node images
• Remotely manage systems - reboot and distributed shell
• Ganglia
• Node use and monitoring system
• Shows CPU, RAM, and I/O usage of nodes
• Can be user accessible
• Zabbix
• Server monitoring software
• Shows node stability and resource usage (CPU, RAM, and I/O)
• Typically not user accessible
46
Job Scheduling and Managing Software
• Torque/Maui
• SLURM
47
Job Scheduling and Managing
Login Node
User 1
User 2
User 3
• Torque/Maui
• Torque manages the resources
PBS Server
(Torque)
• Maui is the scheduler
Maui Scheduler
Service Node
Node 0
Run
Script
48
Node 1
Node N
Job Scheduling and Managing
• Slurm (Simple Linux Utility for Resource Management)
• Very similar to Torque/Maui
• Provides necessary software for starting, stopping, and monitoring
compute jobs and resources
• Allocates and deallocates computing resources on nodes and partitions
• Manages queues of jobs waiting for resources
49
Partitions
• Most multi-user clusters are divided into partitions or subunits that
allow jobs with similar characteristics to run in a common environment
• BlueHive is divided into partitions
• Similar hardware attributes and limitations of running jobs
• User selects which partition to run job
50
Download