revolutionising high performance computing with supermicro

revolutionising high
performance computing
WITH supermicro SOLUTIONS
USING nvidia® tesla®
nvidia® tesla®
Gpus are revolutionizing
computing
The high performance computing (HPC)
industry’s need for computation is increasing,
as large and complex computational problems
become commonplace across many industry
segments. Traditional CPU technology, however,
is no longer capable of scaling in performance
sufficiently to address this demand.
The parallel processing capability of the GPU
allows it to divide complex computing tasks
into thousands of smaller tasks that can
be run concurrently. This ability is enabling
computational scientists and researchers to
address some of the world’s most challenging
computational problems up to several orders of
magnitude faster.
100x
Performance Advantage
by 2021
Conventional CPU computing architecture can no
longer support the growing HPC needs.
Source: Hennesse]y &Patteson, CAAQA, 4th Edition.
10000
20% year
Performance vs VAX
1000
52% year
100
10
25% year
CPU
GPU
Growth per year
1
1978 1980 19821984198619881990 1992 19941996 19982000200220042006
2016
Latest GPU SuperServer®, SuperBlade® and
NVIDIA® Maximus™ Certified SuperWorkstation
Super Micro Computer, Inc. (NASDAQ: SMCI), a
global leader in high-performance, high-efficiency
server technology and green computing, will
showcase its latest graphics processing unit (GPU)
enabled X9 server and workstation solutions at
the NVIDIA GPU Technology Conference (GTC) May
14-17 in San Jose, CA. Supermicro®s GPU solutions
support Intel® Xeon® E5-2600 processors and feature
greater memory capacity (up to 256GB for servers
and 512GB in workstations), higher performance
I/O and connectivity with PCI-E 3.0, 10GbE and 4x
QDR (40Gb) InfiniBand support (GPU SuperBlade)
as well as innovative energy efficient power saving
technologies. Supermicro X9 solutions also feature
the highest density GPU computing available today.
The non-blocking architecture supports 4 GPUs per
1U in a standard, short depth 32”, rack chassis. The
SuperBlade can fit 30 GPUs in 7U- another industry
first from Supermicro®. Combined with the latest
GPUs based on NVIDIA Kepler architecture, the X9
platform offers industry professionals one of the
most powerful, accelerated and ‘green’ computing
solutions available on the market.
“Supermicro® is transforming the high performance
computing landscape with our advanced, highdensity GPU server and workstation platforms,” said
Charles Liang, President and CEO of Supermicro®.
“At GTC, we are showcasing our new generation
X9 SuperServer®, SuperBladeTM and latest NVIDIA
Maximus certified SuperWorkstation systems which
deliver groundbreaking performance, reliability,
scalability and efficiency. Our expanding lines
of GPU-based computing solutions empower
scientists, engineers, designers and many other
professionals with the most cost-effective access to
supercomputing performance.”
Accelerating Research, Scietific, Engineering, Computational Finance
and Design Applications
1U SuperServer®
2U SuperServer®
7U SuperBlade®
SuperWorkstation
1027Gr-TQF
2027GR-TRF
SBI-7127RG
7047GR-TRF
Up to 4x GPUs
Up to 6x GPUs
Up to 20x GPUs
4x NVIDIA Tesla GPUs
1x NVIDIA Quadro GPU
These select GPU enabled servers and workstations
are a sampling of Supermicro’s vast array of
GPU solutions. Visit us at the San Jose McEnery
Convention Center, May 14-17 in GTC Booth #75
to see Supermicro’s latest GPU products. For a
complete look at Supermicro’s extensive line of high
performance, high efficiency GPU solutions, visit
www.supermicro.com/GPU or go to
www.supermicro.com/SuperWorkstations to
keep up with Supermicro’s evolving line of NVIDIA
Maximus powered SuperWorkstations.
3
hybrid
computing
Solutions from Supermicro®
NVIDIA® TESLA / Hybrid computing solutions 2010
superworkstations
TM
Superworkstations
Server Grade Performance for Multimedia,
Engineering, and Scientific Applications
Performance. Efficiency.
Expandability. Reliability
• Optimized Solutions:
Video Editing, Digital Content Creation, MCAD, CAE,
Financial, SW Development, Scientific, Oil and Gas
• Whisper-Quiet 21dB Acoustics models
• Up to Redundant Platinum Level (95%+) Digital
Power Supplies
• Best Value for your IT Investment
• Up to 512 GB DDR3 memory and 8 Hot-Swap HDDs
• Up to 5 GPUs supported with PCI-E 3.0
• Server-Grade Design for 24x7 Operation
The 7047GR-TRF is Supermicro’s latest high-end,
enterprise-class X9 SuperWorkstation, with NVIDIA
Maximus certification. This system accelerates
design and visualization tasks with an NVIDIA Quadro
GPU, while providing dedicated processing power
for simultaneous compute intensive tasks such as
simulation and rendering with up to four NVIDIA
Tesla C2075 GPUs. The upcoming 7047GR-TPRF
SuperWorkstation supports passively cooled GPUs
making it ideal for high performance trading (HPT)
applications. X9 systems feature dual Intel® Xeon®
E5-2600 family processors, maximized memory and
non-blocking native PCI-E 3.0 configurations along
with redundant Platinum level high-efficiency (94%+)
power supplies.
www.supermicro.com/SuperWorkstations
4
7047GR-TRF
• NVIDIA® Maximus™ Certified supporting NVIDIA®
Quadro® and up to 4 NVIDIA® Tesla® for real-time
3D design, visualization, simulation and accelerated
rendering
• GPU Server for Mission-critical applications, enterprise
server, large database, e-business, on-line transaction
processing, oil & gas, medical applications
• Dual Intel® Xeon® processor E5-2600 family; Socket R
(LGA 2011)
• 8 Hot-swap 3.5“ SATA HDD Bays, 3x 5.25“ peripheral
drive bays, 1x 3.5“ fixed drive bay
• 16 DIMMs support up to 512GB DDR3 1600MHz reg.
ECC memory
• 4 (x16) PCI-E 3.0 (support 4 double width GPU cards),
2 (x8) PCI-E 3.0 (1 in x16), and 1 (x4) PCI-E 2.0 (in x8)
slot
• I/O ports: 2 GbE, 1 Video, 1 COM/Serial, 9 USB 2.0
• System management: Built-in Server management
tool (IPMI 2.0, KVM/media over LAN) with dedicated
LAN port
• 4 Hot-swap PWM cooling fans and 2 Hot-swap rear
fans
• 1620W Redundant Platinum Level Power Supplies
High Performance GPU servers
X9 SuperServers which provide a wide range
of configurations targeting high performance
computing (HPC) applications. Systems include the
1027GR-TQF offering up to 4 double-width GPUs
in 1U for maximum compute density in a compact
32” short depth, standard rack mount format. The
2U 2027GR-TRF supports up to 6 GPUs and is ideal
for scalable, high performance computing clusters
in scientific research fields with a 2027GR-TRFT
model available supporting dual-port 10GBase-T for
increased bandwidth and reduced latency. The GPU
SuperBlade SBI-7127RG packs the industry’s highest
compute density of 30 GPUs in 7U delivering ultimate
processing performance for applications such as oil
and gas exploration.
2027GR-TRFT / 2027GR-TRF (-FM475/-FM409)
• GPU Server, Mission-critical app., enterprise server,
large database, e-business, on-line transaction
processing, oil & gas, medical app.
• Dual Intel® Xeon® processor E5-2600 family; Socket R
(LGA 2011)
• 10 Hot-swap 2.5“ SATA HDD Bays (4 SATA2, 6 SATA3)
• 8 DIMMs support up to 256GB DDR3 1600MHz reg. ECC
memory
• 4 (x16) PCI-E 3.0 (support 4 double width GPU cards),
1 (x8) PCI-E 3.0 (in x16), and 1 (x4) PCI-E 2.0 (in x16) slots
• I/O ports: 2 GbE/10GBase-T (TRFT), 1 Video, 1 COM/
Serial, 2 USB 2.0
• 5 heavy duty fans w/ optimal fan speed control
• 1800W Redundant Platinum Level Power Supplies
• „-FM475/409“: 4x NVIDIA M2075/M2090 GPU cards
integrated
Brains, Brains and More Brains!!!
Professor Wu Feng of Virginia Tech‘s Synergy
Laboratory is the main Brain behind VT‘s latest and
greatest Supercomputer, HokieSpeed. HokieSpeed is
a Supermicro supercluster of SuperServer® 2026GTTRFs with thousands of CPU/GPU cores or „brains“.
In Nov. 2011, HokieSpeed ranked 11th for energy
efficiency on the Green500 List and 96th on the
world‘s fastest supercomputing Top500 List. Dr.
Feng‘s Supermicro powered HokieSpeed boasts
single-precision peak of 455 teraflops, or 455 trillion
operations per second, and a double-precision peak of
240 teraflops, or 240 trillion operations per second.
Watch video*:
Superservers in action:
Virginia Tech − HokieSpeed Supercomputer
* Use your phone, smartphone or tablet PC with QR reader software to read the
QR code.
5
hybrid
computing
The foundation of a flexible and Efficient HPC &
Data center deployment
NVIDIA® TESLA / Hybrid computing solutions 2010
TM
High Performance
GPU servers
SuperServers
1017GR-TF (-FM275/-FM209)
• GPU Server, Mission-critical app., enterprise server,
oil & gas, financial, 3D rendering, chemistry, HPC
• Single Intel® Xeon® processor E5-2600 family; Socket R
(LGA 2011)
• 6 Hot-swap 2.5“ SATA HDD Bays
• 8 DIMMs support up to 256GB DDR3 1600MHz reg. ECC
memory
• 2 (x16) PCI-E 3.0 (for 2 GPU cards) and 1 (x8) PCI-E 3.0 slot
• I/O ports: 2 GbE, 1 Video, 2 USB 2.0
• 8 counter rotating fans w/ optimal fan speed control
• 1400W Platinum Level Power Supply w/ Digital Switching
• „-FM175/109“: 1x NVIDIA M2075/M2090 GPU card
integrated
• „-FM275/209“: 2x NVIDIA M2075/M2090 GPU cards
integrated
5017GR-TF (-FM275/-FM209)
• GPU Server, Mission-critical app., enterprise server,
oil & gas, financial, 3D rendering, chemistry, HPC
• Single Intel® Xeon® processor E5-2600 family; Socket R
(LGA 2011)
• 3 Hot-swap 3.5“ SATA HDD Bays
• 8 DIMMs support up to 256GB DDR3 1600MHz reg. ECC
memory
• 2 (x16) PCI-E 3.0 (for 2 GPU cards) and 1 (x8) PCI-E 3.0
(for IB card)
• I/O ports: 2 GbE, 1 Video, 2 USB 2.0
• 8 counter rotating fans w/ optimal fan speed control
• 1400W Platinum Level Power Supply w/ Digital Switching
• „-FM175/109“: 1x NVIDIA M2075/M2090 GPU card
integrated
• „-FM275/209“: 2x NVIDIA M2075/M2090 GPU cards
integrated
1027GR-TRFT/1027GR-TRF (-FM375/-FM309) /
1027GR-TSF
• GPU Server, Mission-critical app., enterprise server,
oil & gas, financial, 3D rendering, chemistry, HPC
• Dual Intel® Xeon® processor E5-2600 family; Socket R
(LGA 2011)
• 4 Hot-swap 2.5“ SATA2/3 HDD Bays
• 8 DIMMs support up to 256GB DDR3 1600MHz reg. ECC
memory
• 3 (x16) PCI-E 3.0 and 1 (x8) PCI-E 3.0 (in x16) slots
• I/O ports: 2 GbE/10GBase-T (TRFT), 1 Video, 1 COM/
Serial, 2 USB 2.0
• 10 heavy duty fans w/ optimal fan speed control
• 1800W Redundant Platinum Level Power Supplies (TRF)
• 1800W Platinum Level Power Supply (TSF)
• „-FM375/309“: 3x NVIDIA M2075/M2090 GPU cards
integrated
6
GPU SUPERBLADE® SOLUTIONS
Superblade
solutions
Supermicro® offers GPU blade solutions optimized
for HPC applications, low-noise blade solutions for
offices and SMB as well as personal supercomputing
applications. With acoustically optimized thermal and
cooling technologies it achieves < 50dB with 10 DP
server blades and features 100-240VAC, Platinum
Level high-efficiency (94%+), N+1 redundant power
supplies.
www.supermicro.com/SuperBlade
Space Optimization
When housed within a 19” EIA-310D industry-standard 42U rack, SuperBlade® servers reduce server footprint in
the datacenter. Power, cooling and networking devices are removed from each individual server and positioned
to the rear of the chassis thereby reducing the required amount of space while increasing flexibility to meet
changing business demands. Up to twenty DP blade nodes can be installed in a 7U chassis. Compared to the
rack space required by twenty individual 1U servers, the SuperBlade® provides over 65% space savings.
SBI-7127RG
•
•
•
•
•
•
•
•
Up to 120 GPU + 120 CPU per 42U Rack!
Up to 2 Tesla M2090/M2075/M2070Q/M2070/2050 GPU
Up to 2 Intel® Xeon® E5-2600 series processors
Up to 256GB DDR3 1600/1333/1066 ECC DIMM
1 internal SATA Disk-On-Module
1 USB flash drive
2 PCI-E 2.0 x16 Full-heigh Full-length Expansion Slots
Onboard BMC for IPMI 2.0 support - KVM over IP, remote
Virtual Media
• 4x QDR (40Gb) InfiniBand or 10Gb Ethernet supported via
optional mezzanine card
• Dual-port Gigabit Ethernet NIC
SBI-7126TG
•
•
•
•
•
•
•
Up to 20 GPU + 20 CPU per 7U!
Up to 2 Tesla M2090/M2070Q/M2070/2050 GPU
Up to 2 Intel® Xeon® 5600/5500 series processors
Up to 96GB DDR3 1333/1066 ECC DIMM
1 internal SATA Disk-On-Module
1 USB flash drive
2 PCI-E 2.0 x16 or 2 PCI-2.0 x8 or 2 PCI-2.0 x8
(Full-height/ maximum length 9.75“)
• Onboard BMC for IPMI 2.0 support - KVM over IP, remote
Virtual Media
• Dual IOH per blade
• Dual 40Gb InfiniBand or 10Gb Ethernet supported via
optional mezzanine card
• Dual-port Gigabit Ethernet NIC
• Redundant GBX connectors
7
Turnkey Preconfigured
GPU Clusters
Ready-to-use solutions
for science and research
ACCESS THE POWER OF GPU COMPUTING WITH OFF-THE-SHELF,
READY TO DEPLOY TESLA GPU CLUSTER
These preconfigured solutions from Supermicro and
NVIDIA provide a powerful tool for researchers and
scientists to advance their science through faster
simulations. GPU Test Drive is the easiest way to start
using GPUs that offer supercomputing scale HPC
performance at substantially lower costs and power.
Experience a significant performance increase in a wide
range of applications from various scientific domains.
Supercharge your research with preconfigured Tesla
GPU cluster today!
42
TFlops
10
TFlops
20
TFlops
SRS-14URKS-GPUS-11
SRS-14URKS-GPUS-12
SRS-42URKS-GPUS-13
GPU Nodes: 4
CPU: 2x Westmere X5650 2.66GHz
Memory: 24 GB per node
Default GPU/Node: 2x M2090
Network Cards/Node: 1x InfiniBand
Storage/Node: 2x 500GB HDD
(1TB/Node)
GPU Nodes: 8 + 1 Head node
CPU: 2x Westmere X5650 2.66GHz
Memory: 48 GB per node
Default GPU/Node: 2x M2090
Network Cards/Node: 1x InfiniBand
Storage/Node: 2x 500GB HDD
(1TB/Node)
GPU Nodes: 16 + 1 Head node
CPU: 2x Westmere X5650 2.66GHz
Memory: 48 GB per node
Default GPU/Node: 2x M2090
Network Cards/Node: 1x InfiniBand
Storage/Node: 2x 1TB HDD (2TB/
Node)
8
Tesla K10 GPU Computing Accelerator ― Optimized
for single precision applications, the Tesla K10 is a
throughput monster based on the ultra-efficient Tesla
Kepler. The accelerator board features two Tesla
Kepler and delivers up to 2x the performance for
single precision applications compared to the previous
generation Fermi-based Tesla M2090 in the same
power envelope. With an aggregate performance of
4.58 teraflop peak single precision and 320 gigabytes
per second memory
bandwidth for both GPUs
put together, the Tesla
K10 is optimized for
computations in seismic,
signal, image processing,
and video analytics.
hybrid
computing
supermicro is committed
to support
®
Tesla kepler GPU accelerators1
Tesla K20 GPU Computing Accelerator ― Designed
for double precision applications and the broader
supercomputing market, the Tesla K20 delivers 3x the
double precision performance compared to the previous
generation Fermi-based Tesla M2090, in the same power
envelope. Tesla K20 features a single Tesla Kepler that
includes the Dynamic Parallelism and Hyper-Q features.
With more than one teraflop peak double precision
performance, the Tesla K20 is ideal for a wide range of
high performance computing
workloads including climate
and weather modeling, CFD,
CAE, computational physics,
biochemistry simulations, and
computational finance.
TECHNICAL SPECIFICATIONS
Tesla K102
Tesla K20
Peak double precision floating point
performance (board)
0.19 teraflops
To be announced
Peak single precision floating point
performance (board)
4.58 teraflops
To be announced
Number of GPUs
2 x GK104s
1 x GK110
CUDA cores
2 x 1536
To be announced
Memory size per board (GDDR5)
8 GB
To be announced
Memory bandwidth for board (ECC off)3
320 GBytes/sec
To be announced
GPU Computing Applications
Seismic, Image, Signal
Processing,
Video analytics
CFD, CAE, Financial computing,
Computational chemistry and
Physics, Data analytics, Satellite
imaging, Weather modeling
Architecture Features
SMX
SMX, Dynamic Parallelism, Hyper-Q
System
Servers only
Servers and Workstations.
Available
May 2012
Q4 2012
1 products and availability is subject to confirmation
2 Tesla K10 specifications are shown as aggregate of two GPUs.
3 With ECC on, 12.5% of the GPU memory is used for ECC bits. So, for example, 6 GB total memory yields 5.25 GB of user
available memory with ECC on.
9
NVIDIA® Tesla® Kepler..
GPU Computing Accelerators..
Tesla® Kepler ― World’s fastest and most power
efficient x86 accelerator
Fastest, Most Efficient HPC Architecture
With the launch of Fermi GPU in 2009, NVIDIA ushered
in a new era in the high performance computing (HPC)
industry based on a hybrid computing model where
CPUs and GPUs work together to solve computationallyintensive workloads. And in just a couple of years, NVIDIA
Fermi GPUs powers some of the fastest supercomputers
in the world as well as tens of thousands of research
clusters globally. Now, with the new Tesla Kepler, NVIDIA
raises the bar for the HPC industry, yet again.
Comprised of 7.1 billion transistors, the Tesla Kepler is an
engineering marvel created to address the most daunting
challenges in HPC. Kepler is designed from the ground
up to maximize computational performance with superior
power efficiency. The architecture has innovations that
make hybrid computing dramatically easier, applicable to
a broader set of applications, and more accessible.
Tesla Kepler is a computational workhorse with
teraflops of integer, single precision, and double
precision performance and the highest memory
bandwidth. The first GK110 based product will be the
Tesla K20 GPU computing accelerator.
Let us quickly summarize three of the most important
features in the Tesla Kepler: SMX, Dynamic Parallelism,
and Hyper-Q. For further details on additional
SM
SMX
FERMI
CONTROL LOGIC
KEPLER
CONTROL LOGIC
3X
PERF/WATT
32 CORES
192 CORES
Figure 1: SMX: 192 CUDA cores, 32 Special Function
Units (SFU), and 32 Load/Store units (LD/ST)
10
architectural features, please refer to the Kepler GK110
whitepaper.
SMX — Next Generation
Streaming Multiprocessor
At the heart of the Tesla Kepler is the new SMX unit, which
comprises of several architectural innovations that make
it not only the most powerful Streaming Multiprocessor
(SM) we’ve ever built but also the most programmable and
power-efficient.
Dynamic Parallelism —
Creating Work on-the-Fly
One of the overarching goals in designing the Kepler
GK110 architecture was to make it easier for developers
to more easily take advantage of the immense parallel
processing capability of the GPU.
To this end, the new Dynamic Parallelism feature enables
the Tesla Kepler to dynamically spawn new threads
by adapting to the data without going back to the host
CPU. This effectively allows more of a program to be run
directly on the GPU, as kernels now have the ability to
independently launch additional workloads as needed.
Any kernel can launch another kernel and can create the
necessary streams, events, and dependencies needed
to process additional work without the need for host
CPU interaction. This simplified programming model is
easier to create, optimize, and maintain. It also creates
a programmer friendly environment by maintaining the
same syntax for GPU launched workloads as traditional
CPU kernel launches.
Dynamic Parallelism broadens what applications can
now accomplish with GPUs in various disciplines.
Applications can launch small and medium sized
parallel workloads dynamically where it was too
expensive to do so previously.
DYNAMIC PARALLELISM
GPU
CPU
GPU
Figure 2: Without Dynamic Parallelism,
the CPU launches every kernel onto
the GPU. With the new feature, Tesla
Kepler can now launch nested kernels,
eliminating the need to communicate
with the CPU.
Hyper-Q — Maximizing the GPU Resources
Hyper-Q enables multiple CPU cores to launch work
on a single GPU simultaneously, thereby dramatically
increasing GPU utilization and slashing CPU idle times.
This feature increases the total number of connections
between the host and the the Tesla Kepler by allowing
32 simultaneous, hardware managed connections,
compared to the single connection available with Fermi.
Hyper-Q is a flexible solution that allows connections
for both CUDA streams and Message Passing Interface
(MPI) processes, or even threads from within a process.
Existing applications that were previously limited by false
dependencies can see up to a 32x performance increase
without changing any existing code.
Hyper-Q offers significant benefits for use in MPIbased parallel computer systems. Legacy MPI-based
algorithms were often created to run on multi-core
CPU-based systems. Because the workload that could
be efficiently handled by CPU-based systems is generally
smaller than that available using GPUs, the amount of
work passed in each MPI process is generally insufficient
to fully occupy the GPU processor.
While it has always been possible to issue multiple
MPI processes to concurrently run on the GPU,
these processes could become bottlenecked by false
dependencies, forcing the GPU to operate below
peak efficiency. Hyper-Q removes false dependency
bottlenecks and dramatically increases speed at which
MPI processes can be moved from the system CPU(s) to
the GPU for processing.
Hyper-Q promises to be a performance boost for MPI
applications.
Conclusion
Tesla Kepler is engineered to deliver ground-breaking
performance with superior power efficiency while
making GPUs easier than ever to use. SMX, Dynamic
Parallelism, and Hyper-Q are three important
innovations in the Tesla Kepler to bring these benefits
to reality for our customers. For further details on
additional architectural features, please refer to the
Kepler GK110 whitepaper at
www.nvidia.com/kepler
Figure 3: Hyper-Q allows all streams to run concurrently using a separate work queue. In the Fermi model,
concurrency was limited due to intra-stream dependencies caused by the single hardware work queue.
To learn more about NVIDIA Tesla, go to www.nvidia.eu/tesla
11
Nvidia® Tesla®
Kepler
CPU
Programming environment
for gpu computing
nvidia’s cuda
CUDA architecture has the industry’s most robust
language and API support for GPU computing
developers, including C, C++, OpenCL, DirectCompute,
and Fortran. NVIDIA Parallel Nsight, a fully integrated
development environment for Microsoft Visual Studio
is also available. Used by more than six million
developers worldwide, Visual Studio is one of the
world’s most popular development environments
for Windows-based applications and services.
Adding functionality specifically for GPU computing
developers, Parallel Nsight makes the power of the
GPU more accessible than ever before. The latest
version - CUDA 4.0 has seen a host of new exciting
features to make parallel computing easier. Among
them the ability to relieve bus traffic by enabling GPU
to GPU direct communications.
research &
Education
Integrated
development
environment
languages
& APIs
LIBRARIES
CUDA
All major
platforms
Mathematical
packages
consultants,
training, &
certification
Tools
& Partners
Order personalized CUDA education course at:
www.parallel-computing.pro
Accelerate Your Code Easily
with OpenACC Directives
Get 2x Speed-up in 4 Weeks or Less
Accelerate your code with directives and tap into the
hundreds of computing cores in GPUs. With directives,
you simply insert compiler hints into your code and the
compiler will automatically map compute-intensive
portions of your code to the GPU.
By starting with a free, 30-day trial of PGI directives
today, you are working on the technology that is the
foundation of the OpenACC directives standard.
OpenACC is:
• Easy: simply insert hints in your codebase
• Open: run the single codebase on either the CPU or
GPU
• Powerful: tap into the power of GPUs within hours
OpenACC directives
The OpenACC Application Program Interface describes
a collection of compiler directives to specify loops and
regions of code in standard C, C++ and Fortran to be
offloaded from a host CPU to an attached accelerator,
providing portability across operating systems, host
CPUs and accelerators.
The directives and programming model defined in
this document allow programmers to create highlevel host+accelerator programs without the need to
explicitly initialize the accelerator, manage data or
program transfers between the host and accelerator,
or initiate accelerator startup and shutdown.
www.nvidia.eu/openacc
Watch video*:
PGI ACCELERATOR, TECHNICAL PRESENTATION AT SC11
* Use your phone, smartphone or tablet PC with QR reader software to read the
QR code.
12
History of gpu computing
Graphics chips started as fixed function graphics
pipelines. Over the years, these graphics chips became
increasingly programmab
hybrid
computing
What is gpu computing
GPU computing is the use of a GPU (graphics
processing unit) to do general purpose scientific and
engineering comp
GPU ACCELERATION IN Life Sciences
Tesla® bio workbench
The NVIDIA Tesla Bio Workbench enables biophysicists
and computational chemists to push the boundaries
of life sciences research. It turns a standard PC into a
“computational laboratory” capable of running complex
bioscience codes, in fields such as drug discovery and
DNA sequencing, more than 10-20 times faster through
the use of NVIDIA Tesla GPUs.
Relative Performance Scale
(normalized ns/day)
1,5
50% Faster
with GPUs
It consists of bioscience applications; a community site
for downloading, discussing, and viewing the results of
these applications; and GPU-based platforms.
Complex molecular simulations that had been only
possible using supercomputing resources can now
be run on an individual workstation, optimizing the
scientific workflow and accelerating the pace of
research. These simulations can also be scaled up
to GPU-based clusters of servers to simulate large
molecules and systems that would have otherwise
required a supercomputer.
Applications that are accelerated on GPUs include:
•
1,0
0,5
192 Quad-Core CPUs
2 Quad-Core CPUs + 4 GPUs
JAC NVE Benchmark
(left) 192 Quad-Core CPUs simulation run on Kraken Supercomputer
(right) Simulation 2 Intel Xeon Quad-Core CPUs and 4 Tesla M2090 GPUs
Molecular Dynamics &
Quantum Chemistry
AMBER, GROMACS, HOOMD,
LAMMPS, NAMD, TeraChem
(Quantum Chemistry), VMD
• Bio Informatics
CUDA-BLASTP, CUDA-EC, CUDA MEME, CUDASW++ (Smith Waterman), GPU-HMMER,
MUMmerGPU
For more information, visit:
www.nvidia.co.uk/bio_workbench
13
NVIDIA® TESLA / Hybrid computing solutions 2010
TM
Researchers today are solving the world’s
most challenging and important problems.
From cancer research to drugs for AIDS,
computational research is bottlenecked by
simulation cycles per day. More simulations
mean faster time to discovery. To tackle
these difficult challenges, researchers
frequently rely on national supercomputers
for computer simulations of their models.
AMBER: NODE PROCESSING COMPARISON
4 Tesla M2090 GPUs
+ 2 CPUs
192 Quad-Core CPUs
69 ns/day
46 ns/day
GPUs offer every researcher supercomputerlike performance in their own office.
Benchmarks have shown four Tesla M2090
GPUs significantly outperforming the existing
world record on CPU-only supercomputers.
RECOMMENDED HARDWARE CONFIGURATION
Workstation
Server
•
•
•
•
•
•
• 8x Tesla M2090
• Dual-socket Quad-core CPU
• 128 GB System Memory
4xTesla C2070
Dual-socket Quad-core CPU
24 GB System Memory Server
Up to 8x Tesla M2090s in cluster
Dual-socket Quad-core CPU per node
128 GB System Memory
180
3T+C2075
3T
160
6T+2xC2075
140
114.9
6T
12T+4xC2075
120
0
cubic
+ pcoupl
dodec
47.5
26.9
21.9
34.4
36.7
12.1
27.7
28.5
44.8
49.7
70.4
74.4
85.7
cubic
8.7
28.4
29.3
16.0
20
8.9
40
45.6
60
15.6
80
83.9
12T
100
72.4
GROMACS is a molecular dynamics package designed
primarily for simulation of biochemical molecules
like proteins, lipids, and nucleic acids that have a
lot complicated bonded interactions. The CUDA port
of GROMACS enabling GPU acceleration supports
Particle-Mesh-Ewald (PME), arbitrary forms of nonbonded interactions, and implicit solvent Generalized
Born methods.
165.8
GROMACS
ns/day
GPU ACCELERATION
IN Life Sciences
Amber
dodec
+ vsites
Figure 4: Absolute performance of GROMACS running
CUDA- and SSE-accelerated non-bonded kernels with PME
on 3-12 CPU cores and 1-4 GPUs. Simulations with cubic
and truncated dodecahedron cells, pressure coupling, as
well as virtual interaction sites enabling 5 fs are shown.
14
AMBER and NAMD 5x Faster
Run your molecular dynamics
simulation 5x faster. Take a free test
drive on a remotely-hosted cluster
loaded with the latest GPU-accelerated
applications such as AMBER and NAMD
GPU Test Drive
Take a Free and Easy Test Drive Today
and accelerate your results. Simply log
on and run your application as usual, no
GPU programming expertise required.
Try it now and see how you can reduce
simulation time from days to hours.
Try the Tesla GPU Test drive today!
Step 1
Register
Step 2
Upload your data
Step 3
Run your application and get faster results
AMBER CPU vs. GPU Performance
Namd CPU vs. GPU Performance
5
1.0
СPU Only
CPU + GPU
0.8
3.96
3
3.00
2
2.04
1
1.20
CPU + GPU
0.6
0.54
0.4
0.2
0.66
0.30
0.16
0.04
0
1 Node
2 Nodes
1.04
СPU Only
ns/Day (Higher is Better)
ns/Day (Higher is Better)
4
4.56
0
4 Nodes
Cellulose NPT – up to 3.5x faster
Benchmarks for AMBER were generated with the following config
-1 Node includes: Dual Tesla M2090 GPU (6GB), Dual Intel 6-core
X5670 (2.93 GHz), AMBER 11 + Bugfix17, CUDA 4.0, ECC off.
0.08
1 Node
2 Nodes
4 Nodes
STMV – up to 6.5x faster
Benchmarks for NAMD were generated with the following config -1
Node includes: Dual Tesla M2090 GPU (6GB), Dual Intel 4-core Xeon
(2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On.
Learn more and order today at: www.nvidia.eu/cluster
15
GPU ACCELERATED ENGINEERING
ansys: supercomputing from your workstation
with NVIDIA Tesla GPU
“
A new feature in ANSYS Mechanical
leverages graphics processing units to
significantly lower solution times for large
analysis problem sizes.”
By Jeff Beisheim,
Senior Software Developer,
ANSYS, Inc.
With ANSYS® Mechanical™ 14.0 and NVIDIA®
Professional GPUs, you can:
• Improve product quality with 2x more design
simulations
• Accelerate time-to-market by reducing engineering
cycles
• Develop high fidelity models with practical solution
times
How much more could you accomplish if simulation
times could be reduced from one day to just a
few hours? As an engineer, you depend on ANSYS
Mechanical to design high quality products efficiently.
To get the most out of ANSYS Mechanical 14.0, simply
upgrade your Quadro GPU or add a Tesla GPU to your
workstation, or configure a server with Tesla GPUs,
and instantly unlock the highest levels of ANSYS
simulation performance.
Future Directions
As GPU computing trends evolve, ANSYS will continue
to enhance its offerings as necessary for a variety
of simulation products. Certainly, performance
improvements will continue as GPUs become
computationally more powerful and extend their
functionality to other areas of ANSYS software.
20x Faster Simulations with GPUs
Design Superior Products
with CST Microwave Studio
Recommended Tesla Configurations
16
Workstation
Server
•4x Tesla C2070
•Dual-socket Quad-core CPU
•48 GB System Memory
•4x Tesla M2090
•Dual-socket Quad-core CPU
•48 GB System Memory
23x Faster with
Tesla GPUs
25
Relative Performance vs CPU
What can product engineers achieve if a single
simulation run-time reduced from 48 hours to 3
hours? CST Microwave Studio is one of the most
widely used electromagnetic simulation software and
some of the largest customers in the world today are
leveraging GPUs to introduce their products to market
faster and with more confidence in the fidelity of the
product design.
20
15
9X Faster with
Tesla GPU
10
5
0
2x CPU
2x CPU
+
1x C2070
2x CPU
+
4x C2070
Benchmark: BGA models, 2M to 128M mesh models.
CST MWS, transient solver.
CPU: 2x Intel Xeon X5620.
Single GPU run only on 50M mesh model.
GPU ACCELERATED
ENGINEERING
GPU acceleration in Computer aided Engineering
MSC Nastran, Marc
5X performance boost with single GPU over single core,
>1.5X with 2 GPUs over 8 core
Sol101, 3.4M DOF
• Nastran direct equation solver is GPU accelerated
– Real, symmetric sparse direct factorization
– Handles very large fronts with minimal use of
pinned host memory
– Impacts SOL101, SOL400 that are dominated by
MSCLDL factorization times
– More of Nastran (SOL108, SOL111) will be moved
to GPU in stages
7
1 Core
1 GPU + 1 Core
4 Core (SMP)
8 Core (DMP=2)
2 GPU + 2 Core
(DMP=2)
6
1.8x
5
1.6x
4
• Support of multi-GPU and for both Linux and Windows
5.6x
4.6x
3
2
– With DMP> 1, multiple fronts are factorized
concurrently on multiple GPUs; 1 GPU per matrix
domain
– NVIDIA GPUs include Tesla 20-series and Quadro
6000
– CUDA 4.0 and above
1
0
3.4M DOF; fs0; Total
3.4M DOF; fs0; Solver
Speed Up
* fs0= NVIDIA PSG cluster node
2.2 TB SATA 5-way striped RAID
Linux, 96GB memory,
Tesla C2050, Nehalem 2.27ghz
SIMULIA Abaqus / Standard
Reduce Engineering Simulation Times in Half
With GPUs, engineers can run Abaqus simulations
twice as fast. A leading car manufacturer, NVIDIA
customer, reduced the simulation time of an engine
model from 90 minutes to 44 minutes with GPUs.
Faster simulations enable designers to simulate
more scenarios to achieve, for example, a more fuel
efficient engine.
Abaqus 6.12 Multi GPU Execution 24 core 2 host,
48 GB memory per host
1,9
1,8
Speed up vs. CPU only
As products get more complex, the task of innovating
with more confidence has been ever increasingly
difficult for product engineers. Engineers rely on
Abaqus to understand behavior of complex assembly
or of new materials.
1,7
1 GPU/Host
2 GPUs/Host
1,6
1,5
1,4
1,3
1,2
1,1
1
1,5
1,5
3,0
3,4
Number of Equations (millions)
3,8
17
Visualize and simulate
at the same time on a single system
Nvidia® Maximus™ Technology
Engineers, designers, and content creation
professionals are constantly being challenged to
find new ways to explore and validate more ideas —
faster. This often involves creating content with both
visual design and physical simulation demands. For
example, designing a car or creating a digital film
character and understanding how air flows over the
car or the character‘s clothing moves in an action
scene.
Unfortunately, the design and simulation processes
have often been disjointed, occurring on different
systems or at different times.
Introducing nvidia Maximus
NVIDIA Maximus-powered workstations solve
this challenge by combining the visualization and
interactive design capability of NVIDIA Quadro
GPUs and the high-performance computing power
of NVIDIA Tesla GPUs into a single workstation.
Tesla companion processors automatically perform
the heavy lifting of photorealistic rendering or
engineering simulation computation. This frees up
CPU resources for the work they‘re best suited for –
I/O, running the operating system and multi-tasking
– and also allows the Quadro GPU to be dedicated to
powering rich, full-performance, interactive design.
Reinventing the workflow
With Maximus, engineers, designers and content
creation professionals can continue to remain
productive and work with maximum complexity in
real-time.
Traditional workstation
Simulate (CPU)
Design
NVIDIA® Maximus™ workstation
Design
Design 2
Design 3
Design 4
Simulate (GPU)
Simulate (GPU)
Simulate (GPU)
Simulate (GPU)
+
+
+
+
Faster Iterations = Faster Time to Market
Watch video*:
NVIDIA Maximus Technology Helps Drive
the Silver Arrow Mercedes-Benz Concept Car
* Use your phone, smartphone or tablet PC with QR reader software to read the
QR code.
18
With Maximus technology, you can
perform simultaneous structural
or fluid dynamics analysis with
applications such as ANSYS while
running your design application,
including SolidWorks and PTC Creo.
8 CPU Cores
+ Tesla C2075
8 CPU Cores
2 CPU Cores
Relative Performance
0
1
2
3
4
For more information, visit
www.nvidia.co.uk/object/teslaansys-accelerations-uk
5
Photorealistic rendering
and CAD
Maximus performance for 3ds max 2012 with iray
Relative Performance Scale vs 8 CPU Cores
With Maximus technology, you
can perform rapid photorealistic
rendering of your designs in
applications such as 3ds Max or
Bunkspeed while still using your
system for other work.
Tesla C2075
+ Quadro 6000
Tesla C2075
+ Quadro 5000
Tesla C2075
+ Quadro 4000
Tesla C2075
+ Quadro 2000
Quadro 2000
8 CPU Cores
Relative Performance
0
1
2
3
4
5
6
7
8
9
10
For more information, visit
www.nvidia.co.uk/object/
quadro-3ds-max-uk
Ray tracing with Catia
live rendering
MAXIMUS PERFORMANCE FOR CATIA LIVE RENDERING
Relative Performance Scale vs 8 CPU Cores
With Maximus technology,
photorealistic rendering is
interactive. And it allows you
to simultaneously run other
applications without bogging down
your system.
Tesla C2075
+ Quadro 6000
Tesla C2075
+ Quadro 4000
Quadro 4000
8 CPU Cores
Relative Performance
Nvidia® Maximus™
Technology
Simulation analysis
and cad
Faster Simulations, more job instances
Relative Performance Scale vs 2 CPU Cores
0
1
2
3
4
5
6
7
8
For more information, visit
www.nvidia.co.uk/object/
quadro-catia-uk
Fast, fluid editing
with premiere pro
ADOBE PREMIERE PRO PRICE/PERFORMANCE*
Adobe Mercury Playback Engine (MPE)
With Maximus technology, you can
relieve the pressure of getting
more done in less time.
Tesla C2075
+ Quadro 2000
Quadro 6000
Quadro 5000
For more information, visit
www.nvidia.co.uk/object/adobe_
PremiereproCS5_uk
Quadro 4000
Quadro 2000
8 CPU Cores
% Value Increase
0
100
200
300
400
500
600
* Adobe Premier Pro results obtained from 5 layers 6 effects per layer output to H.264
on a Dell T7500, 48 GB, Windows 7 at 1440 x 1080 resolution. Price performance
calculated using cost per system and number of clips possible per hour.
19
© 2012 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, NVIDIA Tesla,
CUDA, GigaThread, Parallel DataCache and Parallel NSight are trademarks and/or registered
trademarks of NVIDIA Corporation. All company and product names are trademarks or
registered trademarks of the respective owners with which they are associated. Features,
pricing, availability, and specifications are all subject to change without notice.