SMT/GPU for HPC at CAPPLab! - Witchita State University

advertisement
Bogazici University
Istanbul, Turkey
“SMT/GPU Provides High Performance;
at WSU CAPPLab, we can help you!”
Presented by:
Dr. Abu Asaduzzaman
Assistant Professor in Computer Architecture and Director of CAPPLab
Department of Electrical Engineering and Computer Science (EECS)
Wichita State University (WSU), USA
June 2, 2014
“SMT/GPU Provides High Performance;
at WSU CAPPLab, we can help you!”
►
Outline
■ Introduction
 Single-Core to Multicore Architectures
■ Performance Improvement
 Simultaneous Multithreading (SMT)
 (SMT enabled) Multicore CPU with GPUs
■ Energy-Efficient Computing
 Dynamic GPU Selection
■ CAPPLab
 “People First”
 Resources
 Research Grants/Activities
QUESTIONS? Any time, please!
■ Discussion
Dr. Zaman
2
Thank you!
■ Dr. Professor Can Ozturan
 Chair, ComE Department
 Bogazici University, Istanbul, Turkey
■ Dr. Professor Bayram Yildirim
 Alumni, Bogazici University
 IME Department
 Wichita State University
■ Many more…
Dr. Zaman
3
Introduction
Some Important “Laws”
■
■
■
■
Moore’s law
Amdahl’s law Vs. Gustafson’s law
Law of diminishing returns
Koomey's law
■ (Juggling)
 http://www.youtube.com/watch?v=PqBlA9kU8ZE
 http://www.youtube.com/watch?v=S0d3fK9ZHUI
Dr. Zaman
4
Introduction
Moore’s Law
■ The number of transistors on
integrated circuits doubles
approximately every 18
months.
Dr. Zaman
5
Introduction
Amdahl’s law Vs. Gustafson’s law
■ The speedup of a program using multiple processors in parallel
computing is limited by the sequential fraction of the program.
■ Computations involving arbitrarily large data sets can be parallelized.
Dr. Zaman
6
Introduction
Law of diminishing returns
■ In all productive processes, adding more of one factor of production,
while holding all others constant, will at some point yield lower perunit returns.
Dr. Zaman
7
Introduction
Koomey's law
■ The number of computations
per joule of energy dissipated
has been doubling
approximately every 1.57
years. This trend has been
remarkably stable since the
1950s.
Dr. Zaman
8
Introduction
Single-Core to Multicore Architecture
■ History of Computing
 Word “computer” in 1613 (this is not the beginning)
 Von Neumann architecture (1945) – data/instructions memory
 Harvard architecture (1944) – data memory, instruction memory
■ Single-Core Processors
 In most modern processors: split CL1 (I1, D1), unified CL2, …
 Intel Pentium 4, AMD Athlon Classic, …
■ Popular Programming Languages
 C, …
Dr. Zaman
9
Introduction
(Single-Core to) Multicore Architecture
Cache not shown
 Input
 Process/Store
 Output
Multi-tasking
 Time sharing
(Juggling!)
Courtesy: Jernej Barbič, Carnegie Mellon University
Dr. Zaman
10
Introduction
Single-Core  “Core”
A thread is a running “process”
a single core
Courtesy: Jernej Barbič, Carnegie Mellon University
Dr. Zaman
11
Introduction
Major Steps to
Execute an
Instruction
Memory
(3) O.F.
16b
68000 CPU and
Memory
24b
(5) W.B.
CPU
D7……D0
Data Registers
31…16….8..0
PC
31…16….8..0
1: I.F.
SR
15….8..0
IR
??…16….8..0
Dr. Zaman
A7’A7…A0
Address
Registers
31…16….8..0
(3) O.F.
(5) W.B.
12
4: I.E.
ALU
Decoder /
Control
Unit
2: I.D.
24b
Start
Introduction
Thread 1: Integer (INT) Operation
(Pipelining Technique)
4: Integer
Operation
Thread 1: Integer Operation
1: Instruction
Fetch
2: Instruction
Decode
(3)
Operand(s)
Fetch
Arithmetic
Logic
Unit
Floating
Point
Operation
Dr. Zaman
13
(5) Result
Write Back
Introduction
Thread 2: Floating Point (FP) Operation
(Pipelining Technique)
Integer
Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Floating
Point
Operation
Thread 2: Floating Point Operation
Dr. Zaman
Arithmetic
Logic
Unit
14
Result
Write
Back
Introduction
Threads 1 and 2: INT and FP Operations
(Pipelining Technique)
Integer
Operation
POSSIBLE?
Arithmetic
Logic
Unit
Result
Write
Back
Thread 1: Integer Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Floating
Point
Operation
Thread 2: Floating Point Operation
Dr. Zaman
15
Performance
Threads 1 and 2: INT and FP Operations
(Pipelining Technique)
Integer
Operation
POSSIBLE?
Arithmetic
Logic
Unit
Result
Write
Back
Thread 1: Integer Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Floating
Point
Operation
Thread 2: Floating Point Operation
Dr. Zaman
16
Performance Improvement
Threads 1 and 3: Integer Operations
POSSIBLE?
Thread 1: Integer
Operation
Instruction
Fetch
Thread 3: Integer
Operation
Instruction
Decode
Operand(s)
Fetch
Integer
Operation
Arithmetic
Logic
Unit
Floating
Point
Operation
Dr. Zaman
17
Result
Write
Back
Performance Improvement
Threads 1 and 3: Integer Operations
Integer
(Multicore)
Operation
Thread 1: Integer
Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Core 1
Arithmetic
Logic
Unit
Result
Write
Back
Floating
Point
Operation
POSSIBLE?
Thread 3: Integer
Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Core 2
Dr. Zaman
Integer
Operation
Arithmetic
Logic
Unit
Floating
Point
Operation
18
Result
Write
Back
Performance Improvement
Threads 1, 2, 3, and 4: INT & FP Operations
Integer
Operation
Thread 1: Integer
(Multicore)
Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Core 1
Arithmetic
Logic
Unit
Result
Write
Back
Floating
Point
Operation
Thread 2: Floating Point
Operation
POSSIBLE?
Integer
Operation
Thread 3: Integer
Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Core 2
Thread 4: Floating Point
Operation
Dr. Zaman
19
Arithmetic
Logic
Unit
Floating
Point
Operation
Result
Write
Back
More Performance?
Threads 1, 2, 3, and 4: INT & FP Operations
Integer
Operation
Thread 1: Integer
(Multicore)
Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Core 1
Arithmetic
Logic
Unit
Result
Write
Back
Floating
Point
Operation
Thread 2: Floating Point
Operation
POSSIBLE?
Integer
Operation
Thread 3: Integer
Operation
Instruction
Fetch
Instruction
Decode
Operand(s)
Fetch
Core 2
Thread 4: Floating Point
Operation
Dr. Zaman
20
Arithmetic
Logic
Unit
Floating
Point
Operation
Result
Write
Back
“SMT/GPU Provides High Performance;
at WSU CAPPLab, we can help you!”
►
Outline
■ Introduction
 Single-Core to Multicore Architectures
■ Performance Improvement
 Simultaneous Multithreading (SMT)
 (SMT enabled) Multicore CPU with GPUs
■ Energy-Efficient Computing
 Dynamic GPU Selection
■ CAPPLab
 “People First”
 Resources
 Research Grants/Activities
■ Discussion
Dr. Zaman
21
SMT enabled Multicore CPU with Manycore GPU
for Ultimate Performance!
Parallel/Concurrent Computing
Parallel Processing – It is not fun!
 Let’s play a game: Paying the lunch bill together
Friend
Before
Eating
A
$10
B
$10
C
$10
Total
$30
Total
Bill
$25
Return
$5
Tip
$2
$2
 Started with $30; spent $29 ($27 + $2)
 Where did $1 go?
Dr. Zaman
22
After
Paying
Total
Spent
$1
$9
$1
$9
$1
$9
$27
Performance Improvement
Simultaneous Multithreading (SMT)
■ Thread
 A running program (or code segment) is a process
 Process  processes / threads
■ Simultaneous Multithreading (SMT)
 Multiple threads running in a single-processor at the same time
 Multiple threads running in multiple processors at the same time
■ Multicore Programming Language supports
 OpenMP, Open MPI, CUDA,
Dr. Zaman
23
…C
Performance Improvement
Simultaneous Multithreading (SMT)
■ Example:
■ Generating/Managing Multiple Threads
 OpenMP, Open MPI
Dr. Zaman
…C
24
Performance Improvement
Identify Challenges
■ Sequential data-independent problems
 C[]  A[] + B[]
♦ C[5]  A[5] + B[5]
 A’[]  A[]
Core 1
♦ A’[5]  A[5]
Core 2
 SMT capable multicore processor; CUDA/GPU Technology
Dr. Zaman
25
Performance Improvement
■ CUDA/GPU Programming
■ GP-GPU Card
 A GPU card with 16 streaming
multiprocessors (SMs)
 Inside each SM:
•
•
•
•
•
32 cores
64KB shared memory
32K 32bit registers
2 schedulers
4 special function units
■ CUDA
 GPGPU Programming Platform
Dr. Zaman
26
Performance Improvement
CPU-GPU Technology
■ Tasks/Data exchange mechanism
 Serial Computations – CPU
 Parallel Computations - GPU
Dr. Zaman
27
Performance Improvement
GPGPU/CUDA Technology
■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 1) CPU allocates and copies data to GPU
On CUDA API:
cudaMalloc()
cudaMemCpy()
Dr. Zaman
28
Performance Improvement
GPGPU/CUDA Technology
■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 2) CPU Sends function parameters and
instructions to GPU
CUDA API:
myFunc<<<Blocks, Threads>>>(parameters)
Dr. Zaman
29
Performance Improvement
GPGPU/CUDA Technology
■ The host (CPU) executes a kernel in GPU in 4 steps
(Step 3) GPU executes instruction as
scheduled in warps
(Step 4) Results will need to be copied back to
Host memory (RAM) using cudaMemCpy()
Dr. Zaman
30
Performance Improvement
Case Study 1 (data independent computation
without GPU/CUDA)
■ Matrix Multiplication
Matrices
Dr. Zaman
Systems
31
Performance Improvement
Case Study 1 (data independent computation
without GPU/CUDA)
■ Matrix Multiplication
Execution Time
Dr. Zaman
Power Consumption
32
Performance Improvement
Case Study 2 (data dependent computation
without GPU/CUDA)
■ Heat Transfer on 2D Surface
Execution Time
Dr. Zaman
Power Consumption
33
Performance Improvement
Case Study 3 (data dependent computation with
GPU/CUDA)
■ Fast Effective Lightning Strike Simulation
 The lack of lightning strike protection for the composite materials
limits their use in many applications.
Dr. Zaman
34
Performance Improvement
Case Study 3 (data dependent computation with
GPU/CUDA)
■ Fast Effective Lightning Strike Simulation
■ Laplace’s Equation
■ Simulation
 CPU Only
 CPU/GPU w/o shared memory
 CPU/GPU with shared memory
Dr. Zaman
35
Performance Improvement
Case Study 4 (MATLAB Vs GPU/CUDA)
■ Different simulation models
Traditional sequential program
CUDA program (no shared memory)
CUDA program (with shared memory)
Traditional sequential MATLAB
Parallel MATLAB
CUDA/C parallel programming of the finite difference method based
Laplace’s equation demonstrate up to 257x speedup and 97% energy
savings over a parallel MATLAB implementation while solving a 4Kx4K
problem with reasonable accuracy.
Dr. Zaman
36
Performance Improvement
Identify More Challenges
■ Sequential data-independent problems
 C[]  A[] + B[]
♦ C[5]  A[5] + B[5]
 A’[]  A[]
Core 1
♦ A’[5]  A[5]
Core 2
 SMT capable multicore processor; CUDA/GPU Technology
■ Sequential data-dependent problems
 B’[]  B[]
♦ B’[5]  {B[4], B[5], B[6]}
 Communication needed
♦ Core 1 and Core 2
Core 1
Dr. Zaman
37
Core 2
Performance Improvement
Develop Solutions
■ Task Regrouping
 Create threads
■ Data Regrouping
(Step 2 of 5) CPU
copies data to GPU
On CUDA API:
cudaMemCpy()
 Regroup data
 Data for each thread
 Threads with G2s first
 Then, threads with G1s
Dr. Zaman
38
Performance Improvement
Assess the Solutions
■ What is the Key?
■ Synchronization
(Step 2 of 5) CPU
copies data to GPU
On CUDA API:
cudaMemCpy()
 With synchronization
 Without synchronization
♦ Fast Vs. Accuracy
 Threads with G2s first
 Then, threads with G1s
Dr. Zaman
39
“SMT/GPU Provides High Performance;
at WSU CAPPLab, we can help you!”
►
Outline
■ Introduction
 Single-Core to Multicore Architectures
■ Performance Improvement
 Simultaneous Multithreading (SMT)
 (SMT enabled) Multicore CPU with GP-GPU
■ Energy-Efficient Computing
 Dynamic GPU Selection
■ CAPPLab
 “People First”
 Resources
 Research Grants/Activities
■ Discussion
Dr. Zaman
40
Energy-Efficient Computing
Kansas Unique Challenge
■ Climate and Energy
 Protect environment from harms
due to climate change
 Save natural energy
Dr. Zaman
41
Energy-Efficient Computing
“Power” Analysis
■ CPU with multiple GPU
 GPU usages vary
CPU
■ Power Requirements
 NVIDIA GTX 460 (336-core) - 160W [1]
 Tesla C2075 (448-core) - 235W [2]
 Intel Core i7 860 (4-core, 8-thread) 150-245W [3, 4]
■ Dynamic GPU Selection
 Depending on
♦ the “tasks”/threads
♦ GPU usages
Dr. Zaman
42
GPU
GPU
GPU
Energy-Efficient Computing
CPU-to-GPU Memory Mapping
■ GPU Shared Memory
 Improves performance
 CPU to GPU global memory
 GPU global to shared
■ Data Regrouping
 CPU to GPU global memory
Dr. Zaman
43
Teaching Low-Power HPC Systems
Integrate Research into Education
■ CS 794 – Multicore Architectures Programming
 Multicore Architecture
 Simultaneous Multithreading
 Parallel Programming





Moore’s law
Amdahl’s law
Gustafson’s law
Law of diminishing returns
Koomey's law
Dr. Zaman
44
“SMT/GPU Provides High Performance;
at WSU CAPPLab, we can help you!”
►
Outline
■ Introduction
 Single-Core to Multicore Architectures
■ Performance Improvement
 Simultaneous Multithreading (SMT)
 (SMT enabled) Multicore CPU with GP-GPU
■ Energy-Efficient Computing
 Dynamic GPU Selection
■ CAPPLab
 “People First”
 Resources
 Research Grants/Activities
■ Discussion
Dr. Zaman
45
WSU CAPPLab
CAPPLab
■ Computer Architecture & Parallel Programming
Laboratory (CAPPLab)




Physical location: 245 Jabara Hall, Wichita State University
URL: http://www.cs.wichita.edu/~capplab/
E-mail: capplab@cs.wichita.edu; Abu.Asaduzzaman@wichita.edu
Tel: +1-316-WSU-3927
■ Key Objectives
 Lead research in advanced-level computer architecture, highperformance computing, embedded systems, and related fields.
 Teach advanced-level computer systems & architecture, parallel
programming, and related courses.
Dr. Zaman
46
WSU CAPPLab
“People First”
■ Students




Kishore Konda Chidella, PhD Student
Mark P Allen, MS Student
Chok M. Yip, MS Student
Deepthi Gummadi, MS Student
■ Collaborators





Mr. John Metrow, Director of WSU HiPeCC
Dr. Larry Bergman, NASA Jet Propulsion Laboratory (JPL)
Dr. Nurxat Nuraje, Massachusetts Institute of Technology (MIT)
Mr. M. Rahman, Georgia Institute of Technology (Georgia Tech)
Dr. Henry Neeman, University of Oklahoma (OU)
Dr. Zaman
47
WSU CAPPLab
Resources
■ Hardware
 3 CUDA Servers – CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB
DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory
 2 CUDA PCs – CPU: Xeon E5506, …
 Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64
GB DDR3, Kepler card) via remote access to WSU (HiPeCC)
 2 CUDA enabled Laptops
 More …
■ Software
 CUDA, OpenMP, and Open MPI (C/C++ support)
 MATLAB, VisualSim, CodeWarrior, more (as may needed)
Dr. Zaman
48
WSU CAPPLab
Scholarly Activities
■ WSU became “CUDA Teaching Center” for 2012-13
 Grants from NSF, NVIDIA, M2SYS, Wiktronics
 Teaching Computer Architecture and Parallel Programming
■ Publications
 Journal: 21 published; 3 under preparation
 Conference: 57 published; 2 under review; 6 under preparation
 Book Chapter: 1 published; 1 under preparation
■ Outreach
 USD 259 Wichita Public Schools
 Wichita Area Technical and Community Colleges
 Open to collaborate
Dr. Zaman
49
WSU CAPPLab
Research Grants/Activities
■ Grants





WSU: ORCA
NSF – KS NSF EPSCoR First Award
M2SYS-WSU Biometric Cloud Computing Research Grant
Teaching (Hardware/Financial) Award from NVIDIA
Teaching (Hardware/Financial) Award from Xilinx
■ Proposals




NSF: CAREER (working/pending)
NASA: EPSCoR (working/pending)
U.S.: Army, Air Force, DoD, DoE
Industry: Wiktronics LLC, NetApp Inc, M2SYS Technology
Dr. Zaman
50
Bogazici University; Istanbul, Turkey; 2014
“SMT/GPU Provides High Performance;
at WSU CAPPLab, we can help you!”
QUESTIONS?
Contact: Abu Asaduzzaman
E-mail: abuasaduzzaman@ieee.org
Phone: +1-316-978-5261
http://www.cs.wichita.edu/~capplab/
Thank You!
Download