Bogazici University Istanbul, Turkey “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” Presented by: Dr. Abu Asaduzzaman Assistant Professor in Computer Architecture and Director of CAPPLab Department of Electrical Engineering and Computer Science (EECS) Wichita State University (WSU), USA June 2, 2014 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” ► Outline ■ Introduction Single-Core to Multicore Architectures ■ Performance Improvement Simultaneous Multithreading (SMT) (SMT enabled) Multicore CPU with GPUs ■ Energy-Efficient Computing Dynamic GPU Selection ■ CAPPLab “People First” Resources Research Grants/Activities QUESTIONS? Any time, please! ■ Discussion Dr. Zaman 2 Thank you! ■ Dr. Professor Can Ozturan Chair, ComE Department Bogazici University, Istanbul, Turkey ■ Dr. Professor Bayram Yildirim Alumni, Bogazici University IME Department Wichita State University ■ Many more… Dr. Zaman 3 Introduction Some Important “Laws” ■ ■ ■ ■ Moore’s law Amdahl’s law Vs. Gustafson’s law Law of diminishing returns Koomey's law ■ (Juggling) http://www.youtube.com/watch?v=PqBlA9kU8ZE http://www.youtube.com/watch?v=S0d3fK9ZHUI Dr. Zaman 4 Introduction Moore’s Law ■ The number of transistors on integrated circuits doubles approximately every 18 months. Dr. Zaman 5 Introduction Amdahl’s law Vs. Gustafson’s law ■ The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. ■ Computations involving arbitrarily large data sets can be parallelized. Dr. Zaman 6 Introduction Law of diminishing returns ■ In all productive processes, adding more of one factor of production, while holding all others constant, will at some point yield lower perunit returns. Dr. Zaman 7 Introduction Koomey's law ■ The number of computations per joule of energy dissipated has been doubling approximately every 1.57 years. This trend has been remarkably stable since the 1950s. Dr. Zaman 8 Introduction Single-Core to Multicore Architecture ■ History of Computing Word “computer” in 1613 (this is not the beginning) Von Neumann architecture (1945) – data/instructions memory Harvard architecture (1944) – data memory, instruction memory ■ Single-Core Processors In most modern processors: split CL1 (I1, D1), unified CL2, … Intel Pentium 4, AMD Athlon Classic, … ■ Popular Programming Languages C, … Dr. Zaman 9 Introduction (Single-Core to) Multicore Architecture Cache not shown Input Process/Store Output Multi-tasking Time sharing (Juggling!) Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 10 Introduction Single-Core “Core” A thread is a running “process” a single core Courtesy: Jernej Barbič, Carnegie Mellon University Dr. Zaman 11 Introduction Major Steps to Execute an Instruction Memory (3) O.F. 16b 68000 CPU and Memory 24b (5) W.B. CPU D7……D0 Data Registers 31…16….8..0 PC 31…16….8..0 1: I.F. SR 15….8..0 IR ??…16….8..0 Dr. Zaman A7’A7…A0 Address Registers 31…16….8..0 (3) O.F. (5) W.B. 12 4: I.E. ALU Decoder / Control Unit 2: I.D. 24b Start Introduction Thread 1: Integer (INT) Operation (Pipelining Technique) 4: Integer Operation Thread 1: Integer Operation 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch Arithmetic Logic Unit Floating Point Operation Dr. Zaman 13 (5) Result Write Back Introduction Thread 2: Floating Point (FP) Operation (Pipelining Technique) Integer Operation Instruction Fetch Instruction Decode Operand(s) Fetch Floating Point Operation Thread 2: Floating Point Operation Dr. Zaman Arithmetic Logic Unit 14 Result Write Back Introduction Threads 1 and 2: INT and FP Operations (Pipelining Technique) Integer Operation POSSIBLE? Arithmetic Logic Unit Result Write Back Thread 1: Integer Operation Instruction Fetch Instruction Decode Operand(s) Fetch Floating Point Operation Thread 2: Floating Point Operation Dr. Zaman 15 Performance Threads 1 and 2: INT and FP Operations (Pipelining Technique) Integer Operation POSSIBLE? Arithmetic Logic Unit Result Write Back Thread 1: Integer Operation Instruction Fetch Instruction Decode Operand(s) Fetch Floating Point Operation Thread 2: Floating Point Operation Dr. Zaman 16 Performance Improvement Threads 1 and 3: Integer Operations POSSIBLE? Thread 1: Integer Operation Instruction Fetch Thread 3: Integer Operation Instruction Decode Operand(s) Fetch Integer Operation Arithmetic Logic Unit Floating Point Operation Dr. Zaman 17 Result Write Back Performance Improvement Threads 1 and 3: Integer Operations Integer (Multicore) Operation Thread 1: Integer Operation Instruction Fetch Instruction Decode Operand(s) Fetch Core 1 Arithmetic Logic Unit Result Write Back Floating Point Operation POSSIBLE? Thread 3: Integer Operation Instruction Fetch Instruction Decode Operand(s) Fetch Core 2 Dr. Zaman Integer Operation Arithmetic Logic Unit Floating Point Operation 18 Result Write Back Performance Improvement Threads 1, 2, 3, and 4: INT & FP Operations Integer Operation Thread 1: Integer (Multicore) Operation Instruction Fetch Instruction Decode Operand(s) Fetch Core 1 Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 2: Floating Point Operation POSSIBLE? Integer Operation Thread 3: Integer Operation Instruction Fetch Instruction Decode Operand(s) Fetch Core 2 Thread 4: Floating Point Operation Dr. Zaman 19 Arithmetic Logic Unit Floating Point Operation Result Write Back More Performance? Threads 1, 2, 3, and 4: INT & FP Operations Integer Operation Thread 1: Integer (Multicore) Operation Instruction Fetch Instruction Decode Operand(s) Fetch Core 1 Arithmetic Logic Unit Result Write Back Floating Point Operation Thread 2: Floating Point Operation POSSIBLE? Integer Operation Thread 3: Integer Operation Instruction Fetch Instruction Decode Operand(s) Fetch Core 2 Thread 4: Floating Point Operation Dr. Zaman 20 Arithmetic Logic Unit Floating Point Operation Result Write Back “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” ► Outline ■ Introduction Single-Core to Multicore Architectures ■ Performance Improvement Simultaneous Multithreading (SMT) (SMT enabled) Multicore CPU with GPUs ■ Energy-Efficient Computing Dynamic GPU Selection ■ CAPPLab “People First” Resources Research Grants/Activities ■ Discussion Dr. Zaman 21 SMT enabled Multicore CPU with Manycore GPU for Ultimate Performance! Parallel/Concurrent Computing Parallel Processing – It is not fun! Let’s play a game: Paying the lunch bill together Friend Before Eating A $10 B $10 C $10 Total $30 Total Bill $25 Return $5 Tip $2 $2 Started with $30; spent $29 ($27 + $2) Where did $1 go? Dr. Zaman 22 After Paying Total Spent $1 $9 $1 $9 $1 $9 $27 Performance Improvement Simultaneous Multithreading (SMT) ■ Thread A running program (or code segment) is a process Process processes / threads ■ Simultaneous Multithreading (SMT) Multiple threads running in a single-processor at the same time Multiple threads running in multiple processors at the same time ■ Multicore Programming Language supports OpenMP, Open MPI, CUDA, Dr. Zaman 23 …C Performance Improvement Simultaneous Multithreading (SMT) ■ Example: ■ Generating/Managing Multiple Threads OpenMP, Open MPI Dr. Zaman …C 24 Performance Improvement Identify Challenges ■ Sequential data-independent problems C[] A[] + B[] ♦ C[5] A[5] + B[5] A’[] A[] Core 1 ♦ A’[5] A[5] Core 2 SMT capable multicore processor; CUDA/GPU Technology Dr. Zaman 25 Performance Improvement ■ CUDA/GPU Programming ■ GP-GPU Card A GPU card with 16 streaming multiprocessors (SMs) Inside each SM: • • • • • 32 cores 64KB shared memory 32K 32bit registers 2 schedulers 4 special function units ■ CUDA GPGPU Programming Platform Dr. Zaman 26 Performance Improvement CPU-GPU Technology ■ Tasks/Data exchange mechanism Serial Computations – CPU Parallel Computations - GPU Dr. Zaman 27 Performance Improvement GPGPU/CUDA Technology ■ The host (CPU) executes a kernel in GPU in 4 steps (Step 1) CPU allocates and copies data to GPU On CUDA API: cudaMalloc() cudaMemCpy() Dr. Zaman 28 Performance Improvement GPGPU/CUDA Technology ■ The host (CPU) executes a kernel in GPU in 4 steps (Step 2) CPU Sends function parameters and instructions to GPU CUDA API: myFunc<<<Blocks, Threads>>>(parameters) Dr. Zaman 29 Performance Improvement GPGPU/CUDA Technology ■ The host (CPU) executes a kernel in GPU in 4 steps (Step 3) GPU executes instruction as scheduled in warps (Step 4) Results will need to be copied back to Host memory (RAM) using cudaMemCpy() Dr. Zaman 30 Performance Improvement Case Study 1 (data independent computation without GPU/CUDA) ■ Matrix Multiplication Matrices Dr. Zaman Systems 31 Performance Improvement Case Study 1 (data independent computation without GPU/CUDA) ■ Matrix Multiplication Execution Time Dr. Zaman Power Consumption 32 Performance Improvement Case Study 2 (data dependent computation without GPU/CUDA) ■ Heat Transfer on 2D Surface Execution Time Dr. Zaman Power Consumption 33 Performance Improvement Case Study 3 (data dependent computation with GPU/CUDA) ■ Fast Effective Lightning Strike Simulation The lack of lightning strike protection for the composite materials limits their use in many applications. Dr. Zaman 34 Performance Improvement Case Study 3 (data dependent computation with GPU/CUDA) ■ Fast Effective Lightning Strike Simulation ■ Laplace’s Equation ■ Simulation CPU Only CPU/GPU w/o shared memory CPU/GPU with shared memory Dr. Zaman 35 Performance Improvement Case Study 4 (MATLAB Vs GPU/CUDA) ■ Different simulation models Traditional sequential program CUDA program (no shared memory) CUDA program (with shared memory) Traditional sequential MATLAB Parallel MATLAB CUDA/C parallel programming of the finite difference method based Laplace’s equation demonstrate up to 257x speedup and 97% energy savings over a parallel MATLAB implementation while solving a 4Kx4K problem with reasonable accuracy. Dr. Zaman 36 Performance Improvement Identify More Challenges ■ Sequential data-independent problems C[] A[] + B[] ♦ C[5] A[5] + B[5] A’[] A[] Core 1 ♦ A’[5] A[5] Core 2 SMT capable multicore processor; CUDA/GPU Technology ■ Sequential data-dependent problems B’[] B[] ♦ B’[5] {B[4], B[5], B[6]} Communication needed ♦ Core 1 and Core 2 Core 1 Dr. Zaman 37 Core 2 Performance Improvement Develop Solutions ■ Task Regrouping Create threads ■ Data Regrouping (Step 2 of 5) CPU copies data to GPU On CUDA API: cudaMemCpy() Regroup data Data for each thread Threads with G2s first Then, threads with G1s Dr. Zaman 38 Performance Improvement Assess the Solutions ■ What is the Key? ■ Synchronization (Step 2 of 5) CPU copies data to GPU On CUDA API: cudaMemCpy() With synchronization Without synchronization ♦ Fast Vs. Accuracy Threads with G2s first Then, threads with G1s Dr. Zaman 39 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” ► Outline ■ Introduction Single-Core to Multicore Architectures ■ Performance Improvement Simultaneous Multithreading (SMT) (SMT enabled) Multicore CPU with GP-GPU ■ Energy-Efficient Computing Dynamic GPU Selection ■ CAPPLab “People First” Resources Research Grants/Activities ■ Discussion Dr. Zaman 40 Energy-Efficient Computing Kansas Unique Challenge ■ Climate and Energy Protect environment from harms due to climate change Save natural energy Dr. Zaman 41 Energy-Efficient Computing “Power” Analysis ■ CPU with multiple GPU GPU usages vary CPU ■ Power Requirements NVIDIA GTX 460 (336-core) - 160W [1] Tesla C2075 (448-core) - 235W [2] Intel Core i7 860 (4-core, 8-thread) 150-245W [3, 4] ■ Dynamic GPU Selection Depending on ♦ the “tasks”/threads ♦ GPU usages Dr. Zaman 42 GPU GPU GPU Energy-Efficient Computing CPU-to-GPU Memory Mapping ■ GPU Shared Memory Improves performance CPU to GPU global memory GPU global to shared ■ Data Regrouping CPU to GPU global memory Dr. Zaman 43 Teaching Low-Power HPC Systems Integrate Research into Education ■ CS 794 – Multicore Architectures Programming Multicore Architecture Simultaneous Multithreading Parallel Programming Moore’s law Amdahl’s law Gustafson’s law Law of diminishing returns Koomey's law Dr. Zaman 44 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” ► Outline ■ Introduction Single-Core to Multicore Architectures ■ Performance Improvement Simultaneous Multithreading (SMT) (SMT enabled) Multicore CPU with GP-GPU ■ Energy-Efficient Computing Dynamic GPU Selection ■ CAPPLab “People First” Resources Research Grants/Activities ■ Discussion Dr. Zaman 45 WSU CAPPLab CAPPLab ■ Computer Architecture & Parallel Programming Laboratory (CAPPLab) Physical location: 245 Jabara Hall, Wichita State University URL: http://www.cs.wichita.edu/~capplab/ E-mail: capplab@cs.wichita.edu; Abu.Asaduzzaman@wichita.edu Tel: +1-316-WSU-3927 ■ Key Objectives Lead research in advanced-level computer architecture, highperformance computing, embedded systems, and related fields. Teach advanced-level computer systems & architecture, parallel programming, and related courses. Dr. Zaman 46 WSU CAPPLab “People First” ■ Students Kishore Konda Chidella, PhD Student Mark P Allen, MS Student Chok M. Yip, MS Student Deepthi Gummadi, MS Student ■ Collaborators Mr. John Metrow, Director of WSU HiPeCC Dr. Larry Bergman, NASA Jet Propulsion Laboratory (JPL) Dr. Nurxat Nuraje, Massachusetts Institute of Technology (MIT) Mr. M. Rahman, Georgia Institute of Technology (Georgia Tech) Dr. Henry Neeman, University of Oklahoma (OU) Dr. Zaman 47 WSU CAPPLab Resources ■ Hardware 3 CUDA Servers – CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory 2 CUDA PCs – CPU: Xeon E5506, … Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3, Kepler card) via remote access to WSU (HiPeCC) 2 CUDA enabled Laptops More … ■ Software CUDA, OpenMP, and Open MPI (C/C++ support) MATLAB, VisualSim, CodeWarrior, more (as may needed) Dr. Zaman 48 WSU CAPPLab Scholarly Activities ■ WSU became “CUDA Teaching Center” for 2012-13 Grants from NSF, NVIDIA, M2SYS, Wiktronics Teaching Computer Architecture and Parallel Programming ■ Publications Journal: 21 published; 3 under preparation Conference: 57 published; 2 under review; 6 under preparation Book Chapter: 1 published; 1 under preparation ■ Outreach USD 259 Wichita Public Schools Wichita Area Technical and Community Colleges Open to collaborate Dr. Zaman 49 WSU CAPPLab Research Grants/Activities ■ Grants WSU: ORCA NSF – KS NSF EPSCoR First Award M2SYS-WSU Biometric Cloud Computing Research Grant Teaching (Hardware/Financial) Award from NVIDIA Teaching (Hardware/Financial) Award from Xilinx ■ Proposals NSF: CAREER (working/pending) NASA: EPSCoR (working/pending) U.S.: Army, Air Force, DoD, DoE Industry: Wiktronics LLC, NetApp Inc, M2SYS Technology Dr. Zaman 50 Bogazici University; Istanbul, Turkey; 2014 “SMT/GPU Provides High Performance; at WSU CAPPLab, we can help you!” QUESTIONS? Contact: Abu Asaduzzaman E-mail: abuasaduzzaman@ieee.org Phone: +1-316-978-5261 http://www.cs.wichita.edu/~capplab/ Thank You!