The current trends of HPC - Louisiana Tech University

advertisement
Current Trends in
High Performance Computing
Chokchai Box Leangsuksun, PhD
SWEPCO Endowed Professor*, Computer Science
Director, High Performance Computing Initiative
Louisiana Tech University
box@latech.edu
1
*SWEPCO endowed professorship is made possible by LA Board of Regents
Outline
• 
• 
• 
• 
What is HPC?
Current Trends
More on PS3 and GPU computing
Conclusion
12 December 2011
2
1
Mainstream CPUs
•  CPU speed – plateaus 3-4
Ghz
•  More cores in a single
chip
3-4 Ghz cap
–  Dual/Quad core is now
–  Manycore (GPGPU)
•  Traditional Applications
won’t get a free rides
•  Conversion to parallel
computing (HPC, MT)
This diagram is from “no free lunch article in DDJ
12 December 2011
3
New trends in computing
•  Old & current – SMP, Cluster
•  Multicore computers
–  Intel Core 2 Duo
–  AMD 2x 64
•  Many-core accelerators
–  GPGPU, FPGA, Cell
•  More Many brains in one computer
•  Not to increase CPU frequency
•  Harness many computers – a cluster computing
12/12/11
4
2
What is HPC?
•  High Performance Computing – Parallel , Supercomputing
–  Achieve the fastest possible computing outcome
–  Subdivide a very large job into many pieces
–  Enabled by multiple high speed CPUs, networking, software &
programming paradigms – fastest possible solution
–  Technologies that help solving non-trivial tasks including scientific,
engineering, medical, business, entertainment and etc.
•  Time to insights, Time to discovery, Times to markets
12 December 2011
5
Parallel Programming Concepts
Conventional serial execution
where the problem is represented
as a series of instructions that are
executed by the CPU
Problem
Parallel execution of a problem
involves partitioning of the problem
into multiple executable parts that are
mutually exclusive and collectively
exhaustive represented as a partially
ordered set exhibiting concurrency.
Task
Problem
Task
Task
Task
CPU
Parallel computing takes advantage of
concurrency to :
•  Solve larger problems with less
time
•  Save on Wall Clock Time
•  Overcoming memory
constraints
•  Utilizing non-local resources
12 December 2011
instructions
instructions
CPU
CPU
CPU
CPU
6
Source from Thomas Sterling’s intro to HPC
6
3
HPC Applications and Major Industries
•  Finite Element Modeling
–  Auto/Aero
•  Fluid Dynamics
–  Auto/Aero, Consumer Packaged Goods Mfgs,
Process Mfg, Disaster Preparedness (tsunami)
•  Imaging
–  Seismic & Medical
•  Finance & Business
–  Banks, Brokerage Houses (Regression Analysis,
Risk, Options Pricing, What if, …)
–  Wal-mart’s HPC in their operations
•  Molecular Modeling
–  Biotech and Pharmaceuticals
Complex Problems, Large Datasets, Long Runs
This slide is from Intel presentation “Technologies for Delivering Peak Performance on HPC and Grid Applications”
12 December 2011
7
HPC Drives Knowledge Economy
12/12/11
8
4
Life Science Problem – an example
of Protein Folding
•  Take a computing year (in serial mode) to do molecular dynamics
simulation for a protein folding problem
• Excerpted from IBM David Klepacki’s The future of HPC
• Petaflop = a thousand trillion floating point operations per second
12 December 2011
9
Disaster Preparedness - example
•  Project LEAD
–  Severe Weather prediction
(Tornado) – OU leads.
•  HPC & Dynamically
adaptation to weather
forecast
•  Professor Seidel’s LSU
CCT
– 
– 
– 
– 
Hurricane Route Prediction
Emergency Preparedness
Accuracy of prediction
1 Mile2 = $1 M
12 December 2011
10
5
HPC accelerates a product
•  FE analysis on 1 CPU
–  1,000,000 elements
–  Numerical processing for 1
element = .1 secs
–  One computer will take
100,000 secs = 27.7 hrs
•  Says 100 CPUs
–  .27 hr ~ 16 mins
12 December 2011
11
Avian Flu Pandemic Modeled on a
Supercomputer
• MIDAS (Models of Infectious Disease Agent Study)
program
• The large-scale, stochastic simulation model examines
the nationwide spread of a pandemic influenza virus
strain
• A simulation starts with 2 passengers with contaminated
AF arriving LAX
• The simulation rolls out a city-city and census-tract-level
picture of the spread of infection
• a synthetic population of 281 million people over the
course of 180 days
• It is a very large scale and complex multi-variant
12 December 2011
12
6
Avian Flu Pandemic (90 days)
Timothy C. Germann, Kai Kadau, Catherine A. Macken (Los Alamos National Laboratory); Ira M. Longini Jr. (Emory University)
12 December 2011
Source from www.lanl.gov
13
Avian Flu Pandemic (II)
• The results show that advance preparation of a modestly
effective vaccine in large quantities appears to be preferable to
waiting for the development of a well-matched vaccine that
may be too late.
• The simulation models a synthetic population that matches
U.S. census demographics and worker mobility data by
randomly assigning the simulated individuals to households,
workplaces, schools, and the like.
• The models serve as virtual laboratories to study how
infectious diseases and what intervention strategies are more
effective
• Run on the Los Alamos supercomputer known as Pink, a
1,024-node (2,048 processor) LinuxBIOS/Bpro with 2 GB/
node.
Source from www.lanl.gov
12 December 2011
14
7
Significant indicators – why HPC
now?
•  Main stream computers with multi-cores (Intel or AMD)
– 
– 
– 
– 
In past 1-2 years, CPU speed was flatten at 3+ Ghz
More CPUs in one chip – Dual core, multi-core chips
Traditional software won’t take advantage of these new processors
Personal/Desktop Supercomputing.
•  Many real problems are highly computational intensive.
– 
– 
– 
– 
NSA uses supercomputing to do data mining
DOE – fusion, plasma, energy related (including weaponry).
Help solving many other important areas (nanotech, life science etc.)
Product design, ERM/Inventory Management
•  Giants recently sneeze out HPC
–  Bush’s state of union speech – 3 main S&T focus of which Supercomputing is
one of them
–  Bill Gates’ keynote speech at SC05 – MS goes after HPC
•  Google search engine - 100,000 nodes
•  Playstation 3 is a personal supercomputing platform
•  Hollywood (Entertainment) is HPC-bound (Pixar – more than 3000 CPUs to
render animation)
12 December 2011
15
HPC preparedness
•  Build work forces that understand HPC paradigm
& its applications
–  HPC/Grid Curriculum in IT/CS/CE/ICT
–  Offer HPC-enabling tracks to other disciplinary
(engineering, life science, physic, computational chem,
business etc..)
–  Training business community
–  Bring awareness to public
•  National and strategic policies
•  Improve Infrastructure
12 December 2011
16
8
Pause here
•  Switch to a tour of machine rooms – Clusters, our
Lab to show what they will be using..
•  Get students’ info on signup sheet for accounts on
our clusters (azul, quadcore, GPU and PS3).
•  Intro to Linux
•  Then continue on HPC101
12/12/11
17
HPC 101
12 December 2011
18
9
How to Run Applications Faster ?
•  There are 3 ways to improve performance:
– Work Harder
– Work Smarter
– Get more Help
•  Computer Analogy
– Using faster hardware
– Optimized algorithms and techniques used to solve
computational tasks
– Multiple computers to solve a particular task
12 December 2011
19
Parallel Programming Concepts
Task
Problem
Task
Task
Task
instructions
CPU
12 December 2011
CPU
CPU
CPU
Source from Thomas Sterling’s intro to HPC
20
10
HPC objective
•  High Performance Computing – Parallel , Supercomputing
–  Achieve the fastest possible computing outcome
–  Subdivide a very large job into many pieces
–  Enabled by multiple high speed CPUs, networking, software &
programming paradigms – fastest possible solution
–  Technologies that help solving non-trivial tasks including scientific,
engineering, medical, business, entertainment and etc.
12 December 2011
21
Flynn’s Taxonomy of Computer
Architectures
l 
SISD - Single Instruction/Single Data
l 
SIMD - Single Instruction/Multiple Data
l 
MISD - Multiple Instruction/Single Data
l 
MIMD - Multiple Instruction/Multiple Data
22
11
Single Instruction/Single Data
PU – Processing Unit
Your desktop, before the spread of dual core CPUs
Slide Source: Wikipedia, Flynn’s Taxonomy
23
Flavors of SISD
Instructions:
24
12
More on pipelining…
25
Single Instruction/Multiple Data
Processors that execute same instruction on
multiple pieces of data: NVIDIA GPUs
Slide Source: Wikipedia, Flynn’s Taxonomy
26
13
Single Instruction/Multiple Data
l 
l 
Each core runs the same set of instructions on different data
Example:
l 
GPGPU: processes pixels of an image in parallel
Slide Source: Klimovitski & Macri, Intel
27
SISD versus SIMD
Writing a compiler for SIMD architectures is VERY difficult
(inter-thread communication complicates the picture…)
Slide Source: ars technica, Peakstream article
28
14
Multiple Instruction/Single Data
Pipe line : CMU Warp machine.
Slide Source: Wikipedia, Flynn’s Taxonomy
29
Multiple Instruction/Multiple Data
e.g. Multicore systems were based on a MIMD architecture + programming paradigm
Such as openMP, multithreads
Slide Source: Wikipedia, Flynn’s Taxonomy
30
15
Multiple Instruction/Multiple Data
l 
The sky is the limit: each PU is free to do as it pleases
l 
Can be of either shared memory or distributed memory
categories
Instructions:
31
Current HPC Hardware
•  Traditionally HPC has adopted expensive parallel
hardware:
–  Massively Parallel Processors (MPP)
–  Symmetric Multi-Processors (SMP)
• 
• 
• 
• 
Cluster Computers
Recent trends in HPC …
Multicore systems
Heterogeneous Computing with Accelerator
Boards (GPGPU, FPGA)
12 December 2011
32
16
HPC cluster
• Login
• Compile
• Submit job
• At least 2 connections
• Run tasks
12 December 2011
33
Parallel Programming Env
•  Parallel Programming Environments and Tools
–  Threads (PCs, SMPs, NOW..)
•  POSIX Threads
•  Java Threads
–  MPI
•  Linux, NT, on many Supercomputers
–  OpenMP (predominantly on SMP)
– 
– 
– 
– 
– 
– 
– 
– 
– 
PVM (old)
UPC, Co-array Fortran
CUDA, Brooks+, openCL
Software DSMs (Shmem)
Compilers
RAD (rapid application development tools)
Debuggers
Performance Analysis Tools
Visualization Tools
12 December 2011
34
17
Recent Trends in HPC Hardware
• 
• 
• 
• 
• 
• 
• 
Multicore & Manycore are now.
Multi CPUs in a single die
Better power consumption
tightly couple and better for multi-threading
GPGPU
As a build blocks for a much larger system
New Top 500 HPC systems - clusters of multi-core
& GPGPU
12 December 2011
35
What are HPC systems
12/12/11
36
18
Current top 5 systems
12/12/11
37
Shared vs Distributed Memory
12/12/11
38
19
Shared memory
•  Global memory space, accessible by all
processors
•  Processors may have local memory to hold copies
of some global memory.
•  Consistency of copies is usually maintained by
hardware (cache coherency)
12/12/11
39
Two typical classes of SM
•  Uniform Memory Access (UMA):
–  Equal access times
–  identical processors typically represented by Symmetric
Multi- processor Machines (SMP) or Multicores
•  Non-Uniform Memory Access (NUMA):
–  Memory access times are not uniform, memory access
across a link is slower
–  Often made by physically linking two or more SMPs or
heterogeneous computing
12/12/11
40
20
Advantage & Disadvantage
•  Global address space is user-friendly
•  Data sharing between tasks is fast
•  System may suffer from lack of scalability. Adding
CPUs increases traffic on shared memory - to - CPU
path. This is especially true for cache coherent
systems
•  Programmer is responsible for correct
synchronization
•  Systems larger than an SMP need some specialpurpose components.
12/12/11
41
Distributed Memory
12/12/11
42
21
Multicores
•  Three multicore classifications
–  Homogeneous
–  Heterogeneous
–  Hybrid
12 December 2011
43
Multicores(I)
•  Homogeneous Cores (a main CPU)
–  All cores are identical
–  A traditional MC with few cores
•  Good for jumbo & few tasks
–  Not as many tasks/threads as accelerators or GPU.
–  E.g. Intel Core2Duo, i3, i5, i7, AMD
–  Programming – Multithreads/openMP
12 December 2011
44
22
Multicores(II)
•  Homogeneous Cores as accelerator or compute
device
–  Need a main CPU system
–  As attached processing units
–  All cores are identical and many
–  Good for many SIMD tasks/threads
–  E.g. NVIDIA GPGPU, Clearspeed FPGA
–  Programming – library calls from a main program or a
new language extension, e.g. CUDA
12 December 2011
45
Multicores(III)
•  Heterogeneous Cores
–  All cores are NOT identical
–  All in one die
–  Programming is more difficult
–  See more in PS3 presentation
12 December 2011
46
23
Multicores(IV)
•  Hybrid System
–  Mix between host cores &
accelerator cores
–  A typical host can be a
desktop to server system, e.g.
Intel or AMD
–  Accelerator – NVDIA, ATI
Stream or FPGA
–  Programming model is more
complex
–  Issues – memory bandwidth
between host vs. devices
12 December 2011
47
Introduction to Cell BE (PS3)
Programming
HPCI: High Performance
Computing Initiative
24
PS3 - awesome HPC system
•  IBM Cell processor
•  Affordable
•  But currently not many
tools
12 December 2011
49
Cell BE Architecture
PowerPC Processor Element
Main Processor
64 bit
Also support Vector/SIMD
Run the OS, Manage SPE
12 December 2011
Synergistic Processor Element
128-bit RISC, SIMD processor
256 KB local storage memory
Use DMA to transfer data between
local storage and main memory
Picture ref: http://gamasutra.com/features/20060721/chow_01.shtml
25
Cell Programming
• 
• 
• 
• 
IBM Cell SDK
Main Process run on PPE
Threads run on SPEs
PPE Centric programming paradigm
PPE process
SPE thread
SPE thread
SPE thread
...
12 December 2011
GPGPU
General Purpose Graphic
Processing Unit
12/12/11
52
26
Two major players
Parallel Computing on a GPU
• 
NVIDIA GPU Computing Architecture
–  Via a HW device interface
–  In laptops, desktops, workstations, servers
• 
• 
• 
• 
8-series GPUs deliver 50 to 500 GFLOPS
on compiled parallel C applications
Tesla T10 1070 from 1-4 TFLOPS
GPU parallelism is better than Moore’s
law, more doubling every year
GPGPU is a GPU that allows user to
process both graphics and non-graphics
applications.
Tesla D870
GeForce 8800
© David Kirk/NVIDIA
and Wen-mei W. Hwu,
2007
ECE 498AL, University
of Illinois, UrbanaChampaign
27
NVIDIA GeForce 8800 (G80)
•  the eighth generation of NVIDIA’s GeForce graphic
cards.
•  High performance CUDA-enabled GPGPU
•  128 cores
•  Memory 256-768 MB or 1.5 GB in Tesla
•  High-speed memory bandwidth
•  Supports Scalable Link Interface (SLI)
NVIDIA TeslaTM
•  Feature
–  GPU Computing for HPC
–  No display ports
–  Dedicate to computation
–  For massively Multi-threaded computing
–  Supercomputing performance
28
NVIDIA Tesla Card >>
•  C-Series(Card) = 1 GPU with 1.5 GB
•  D-Series(Deskside unit) = 2 GPUs
•  S-Series(1U server) = 4 GPUs
•  Note: 1 G80 GPU = 128 cores = ~500 GFLOPs
• 
1 T10 = 240 cores = 1 TFLOPs
<< NVIDIA G80"
This slide is from NVDIA CUDA tutorial
© David Kirk/
NVIDIA and Wen-mei
W. Hwu, 2007!
ECE 498AL, University
of Illinois, UrbanaChampaign!
29
GPGPU Programming with
CUDA
•  CUDA (Compute Unified Device
Architecture) is a SDK and API that allow a
programmer to write C and Fortran
programs to execute on GPGPU.
•  Works with NVIDIA G80 or later and Tesla
•  The GPGPU is viewed as a compute
device
ATI Stream (1)
12/12/11
60
30
ATI 4870
12/12/11
61
ATI 4870 X2
12/12/11
62
31
Architecture of ATI Radeon 4000 series
This slide is from ATI presentation
32
This slide is from ATI presentation
Introduction to
Open CL
Toward new approach in Computing
Moayad Almohaishi
33
Introduction to openCL
• OpenCL stands for Open Computing Language.
• It is from consortium efforts such as Apple,
NVDIA, AMD etc.
• The Khronos group who was responsible for
OpenGL.
• Take 6 months to come up with the
specifications.
OpenCL
• 1. Royalty-free.
• 2. Support both task and data parallel programing
modes.
• 3. Works for vendor-agnostic GPGPUs
• 4. including multi cores CPUs
• 5. Works on Cell processors.
• 6. Support handhelds and mobile devices.
• 7. Based on C language under C99.
34
OpenCL
• Can make query on available devices and build
an context of the available devices.
• Programmers would be able to program more
freely for any kind of device.
• Applications are more resuable even if the
hardware changed in the future.
35
OpenCL Platform Model
CPUs+GPU platforms
12/12/11
72
36
Performance of GPGPU
Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS
© David Kirk/NVIDIA
and Wen-mei W. Hwu,
2007
ECE 498AL, University
of Illinois, UrbanaChampaign
37
Last words!
•  HPC or Supercomputing system is not necessarily gigantic
in a big machine room but is accessible for Thais and may
now be sitting next to your desk
•  Computing is necessity and Fast computing provides
competitive edge, esp Knowledge Economy
•  New trends of HPC includes GPGPU, various
multicore architecture
•  Prepare ourselves and strengthen our S&T, and industry
as well business community for this phenomenon (HPC
goes mainstream) before too late.
12 December 2011
75
Back up slides
12/12/11
76
38
Cancer Gene-mining
• 
• 
Unsuccessful on a uni-processor
Our approach
–  Novel parallel gene-mining
algorithms
–  Input from microarray
–  Retain accuracy
–  Significantly speed up (superlinear)
• 
IBM P5 supercomputer (128 node
PPC).
Time to run the algorithm, keeping number of nodes fixed
Mesothelioma
Time taken(in secs)
1200
1000
Bladder
100
Breast
80
60
Renal
Leukemia
40
800
20
Prostate
600
0
Lung
400
Pancreas
200
0
Colorectal
Ovary
13
39
65
91
Lymphoma
Melanoma
Number of processors
OvaMarker based Selection
GeneSetMine based Selection
12 December 2011
77
Drug Delivery
• 
• 
• 
• 
By WU & Palmer, Louisiana Tech U
Assisted by HPCI
A study of microcapsules for drug delivery.
Computational Fluid Dynamics methodology to
model the generation of droplets or cores (using
alginate and oil)
•  Goal: better understanding process parameters
needed for generating cores of homogeneous
size for the manufacturing of microcapsules.
12 December 2011
78
39
Droplet Generation:
Experimental Procedure
12 December 2011
79
Droplet Generation: Example Results
Case 1:
Olive oil:
Density 930 kg/m3
Viscosity 0.03 kg/m-s
Alginate:
Density 1012 kg/m3
Viscosity 0.2137 kg/m-s
Case 2:
Phase 1:
Density 918 kg/m3
Viscosity 0.084 kg/m-s
Phase 2:
Density 998.2 kg/m3
Viscosity 0.001003 kg/m-s
12 December 2011
Source from wu’s thesis
80
40
Download