PNY Solutions for GPU Accelerated HPC

advertisement
PNY Solutions for
GPU Accelerated HPC
ABOUT PNY
Founded in 1985, PNY Technologies has over 25 years of excellence in the manufacturing of a full spectrum of highquality products for everything in and around the computer.
PNY Technologies has a long history of providing designers, engineers and developpers with cutting-edge NVIDIA®
Quadro™ and Tesla™ Solutions. PNY understands the needs of its professional clients offering professional technical
support and a constant commitment to customer satisfaction.
Based on the revolutionary architecture of massively parallel computing NVIDIA® CUDA™, the NVIDIA® Tesla™
solutions by PNY are designed for high performance computing (HPC) and offer a wide range of development tools.
In 2012 PNY Technologies enhances its presence and offer in the HPC market becoming the European distributor
of TYAN® servers based on NVIDIA® Tesla ™processors.
www.pny.eu
ABOUT TYAN
TYAN designs, manufactures and markets advanced x86 and x86-64 server/workstation board technology,
platforms, and server solution products.
TYAN enables its customers to be technology leaders by providing scalable, highly-integrated, and reliable products
for a wide range of applications such as server appliances and solutions for high-performance computing, GPU
Computing and parallel computing.
www.tyan.com
unit) to do general purpose scientific and engineering
comp
HYBRID
COMPUTING
NVIDIA® TESLA®
GPUs ARE REVOLUTIONIZING
WHAT IS GPU COMPUTING
HISTORY OF GPU COMPUTING
COMPUTING
GPU computing is the use of a GPU (graphics processing
Graphics chips started as fixed function graphics
pipelines. Over the years, these graphics chips became
increasingly programmab
The high performance computing (HPC) industry’s
need for computation is increasing, as large
and complex computational problems become
commonplace across many industry segments.
Traditional CPU technology, however, is no longer
capable of scaling in performance sufficiently to
address this demand.
The parallel processing capability of the GPU allows
it to divide complex computing tasks into thousands
of smaller tasks that can be run concurrently. This
ability is enabling computational scientists and
researchers to address some of the world’s most
challenging computational problems up to several
orders of magnitude faster.
100x
Performance Advantage
by 2021
Conventional CPU computing architecture can no longer
support the growing HPC needs.
Source: Hennesse]y &Patteson, CAAQA, 4th Edition.
10000
20% year
Performance vs VAX
1000
52% year
100
10
25% year
1
1978 1980 19821984198619881990 1992 19941996 19982000200220042006
CPU
GPU
Growth per year
3
2016
WHY GPU COMPUTING?
With the ever-increasing demand for more computing
performance, the HPC industry is moving toward a hybrid
computing model, where GPUs and CPUs work together
to perform general purpose computing tasks. As parallel
processors, GPUs excel at tackling large amounts of similar
data because the problem can be split into hundreds or
thousands of pieces and calculated simultaneously.
WHAT IS GPU COMPUTING
HISTORY OF GPU COMPUTING
GPU computing is the use of a GPU (graphics processing
unit) to do general purpose scientific and engineering
comp
Graphics chips started as fixed function graphics
pipelines. Over the years, these graphics chips became
increasingly programmab
HYBRID
COMPUTING
As sequential processors, CPUs are not designed for
this type of computation, but they are adept at more
serial based tasks such as running operating systems and
organizing data. NVIDIA believes in applying the most
relevant processor to the specific task in hand.
CORE COMPARISON BETWEEN A CPU AND A GPU
The new computing model includes
both a multi-core CPU and a GPU
with hundreds of cores.
Multiple
Cores
Hundreds of Cores
CPU
GPU
GPU SUPERCOMPUTING — GREEN HPC
GPUs significantly increase overall system efficiency
as measured by performance per watt. “Top500”
supercomputers based on heterogeneous architectures
are, on average, almost three times more powerefficient than non-heterogeneous systems. This is also
reflected on Green500 list — the icon of Eco-friendly
supercomputing.
“
The rise of GPU supercomputers
on the Green500 signifies that
heterogeneous systems, built with both
GPUs and CPUs, deliver the highest
performance and unprecedented energy
efficiency,”
said Wu-chun Feng,
founder of the Green500 and
associate professor of
Computer Science at Virginia Tech.
4
PROGRAMMING ENVIRONMENT
FOR GPU COMPUTING
NVIDIA’S CUDA
CUDA architecture has the industry’s most robust
language and API support for GPU computing developers,
including C, C++, OpenCL, DirectCompute, and
Fortran. NVIDIA Parallel Nsight, a fully integrated
development environment for Microsoft Visual Studio is
also available. Used by more than six million developers
worldwide, Visual Studio is one of the world’s most
popular development environments for Windows-based
applications and services. Adding functionality specifically
for GPU computing developers, Parallel Nsight makes the
power of the GPU more accessible than ever before. The
latest version - CUDA 4.0 has seen a host of new exciting
features to make parallel computing easier. Among them
the ability to relieve bus traffic by enabling GPU to GPU
direct communications.
RESEARCH &
EDUCATION
INTEGRATED
DEVELOPMENT
ENVIRONMENT
LANGUAGES
& APIS
LIBRARIES
CUDA
ALL MAJOR
PLATFORMS
MATHEMATICAL
PACKAGES
CONSULTANTS,
TRAINING, &
CERTIFICATION
TOOLS
& PARTNERS
Order personalized CUDA education course at:
www.parallel-computing.pro
ACCELERATE YOUR CODE EASILY
WITH OPENACC DIRECTIVES
GET 2X SPEED-UP IN 4 WEEKS OR LESS
Accelerate your code with directives and tap into the
hundreds of computing cores in GPUs. With directives,
you simply insert compiler hints into your code and
the compiler will automatically map compute-intensive
portions of your code to the GPU.
By starting with a free, 30-day trial of PGI directives today,
you are working on the technology that is the foundation of
the OpenACC directives standard.
OpenACC is:
• Easy: simply insert hints in your codebase
• Open: run the single codebase on either the CPU or GPU
• Powerful: tap into the power of GPUs within hours
OpenACC DIRECTIVES
The OpenACC Application Program Interface describes a
collection of compiler directives to specify loops and regions
of code in standard C, C++ and Fortran to be offloaded
from a host CPU to an attached accelerator, providing
portability across operating systems, host CPUs and
accelerators.
The directives and programming model defined in
this document allow programmers to create high-level
host+accelerator programs without the need to explicitly
initialize the accelerator, manage data or program transfers
between the host and accelerator, or initiate accelerator
startup and shutdown.
www.nvidia.eu/openacc
WATCH VIDEO*:
PGI ACCELERATOR, TECHNICAL PRESENTATION AT SC11
* Use your phone, smartphone or tablet PC with QR reader software to read the QR code.
4
A WIDE RANGE OF GPU ACCELERATED APPLICATIONS
ISV/APPLICATION
SUPPORTED FEATURES
EXPECTED
SPEED UP*
ANSYS / ANSYS Mechanical
Direct & iterative solver
2x
SIMULIA / Abaqus/Standard
Direct solver
1.5-2.5x
IMPETUS / Afea
SPH, blast simulations
10x SPH, 2x
MSC Nastran
Nastran: Direct solver
1.4 - 2x
FluiDyna LBultra
Lattice-Boltzmann
20x
Vratis SpeedIT- OpenFOAM Solver
Linear Equation solvers
3x Solver
FluiDyna Culises – OpenFOAM
Solver
Linear equation Solvers
3x
Agilent Technologies EMPro
FDTD solver
6x
CST Microwave Studio (MWS)
Transient solver
9x-20x
SPEAG SEMCAD-X
FDTD solver
100x
HIGH
PERFORMANCE
COMPUTING
ENGINEERING: COMPUTATIONAL FLUID DYNAMICS, STRUCTURAL MECHANICS, EDA
BUSINESS APPLICATIONS: COMPUTATIONAL FINANCE, DATA MINING, NUMERICAL ANALYSIS
Murex
MACS analytics library
60x-400x
MathWorks MATLAB
Support for over 200 common MATLAB functions
2 - 20x
Numerical Algorithms Group
Random number generators, Brownian bridges, and PDE solvers
50x
SciComp, Inc
Monte Carlo and PDE pricing models
30x-50x(MC),
10x-35x(PDE)
Wolfram Mathematica
Development environment for CUDA
2 - 20x
AccelerEyes Jacket for MATLAB
Support for several hundred common MATLAB functions
2 - 20x
Jedox Palo
Extend Excel with OLAP for planning & analysis
20-40x
ParStream
Database and data analysis Acceleration using GPUs
10x
GeoStar RTM
Seismic Imaging
4x-20x
GeoMage Multifocusing
Seismic Imaging
4x-20x
PanoramaTech: Marvel
Seismic Modeling, Imaging, Inversion
4x-20x
Paradigm Echos, SKUA, VoxelGeo
Seismic Imaging, Interpretation, Reservoir Modeling
5x-40x
Seismic City RTM
Seismic Imaging
4x-20x
Tsunami A2011
Seismic Imaging
4x-20x
OIL AND GAS
LIFE SCIENCES: BIO-INFORMATICS, MOLECULAR DYNAMICS, QUANTUM CHEMISTRY, MEDICAL IMAGAMBER
PMEMD: Explicit and Implicit Solvent
8x
GROMACS
Implicit (5x), Explicit(2x) Solvent
2x-5x
HOOMD-Blue
Written for GPUs (32 CPU cores vs 2 10xx GPUs)
2x
LAMMPS
Lennard-Jones, Gay-Berne
3.5-15x
NAMD
Non-Bond Force calculation
2x-7x
VMD
High quality rendering, large structures (100 million atoms)
125x
TeraChem
“Full GPU-based solution”
44-650x
VASP
Hybrid Hartree-Fock DFT functionals including exact exchange
2 GPUs comparable
to up to 8 (8-core)
CPUs
GPU HMMER
hmmsearch tool
60x-100x
Digisens
Computing/reconstruction algorithms, pre and post filtering
100x
* Expected Speed Up vs a quad-core x64 CPU based system. Speed-ups as per NVIDIA in house testing or application provider documentation.
** Nastran: 5x with single GPU over single core and 1.5 to 2x with 2 GPUs over 8 core. Marc: 5x with single GPU over single core runs and 2x with 2 GPUs (with DMP=2). (2 quad-core Nehalem CPUs).
5
GPU ACCELERATION IN LIFE SCIENCES
TESLA® BIO WORKBENCH
The NVIDIA Tesla Bio Workbench enables biophysicists
and computational chemists to push the boundaries
of life sciences research. It turns a standard PC into a
“computational laboratory” capable of running complex
bioscience codes, in fields such as drug discovery and DNA
sequencing, more than 10-20 times faster through the use of
NVIDIA Tesla GPUs.
Relative Performance Scale
(normalized ns/day)
1,5
50% Faster
with GPUs
Complex molecular simulations that had been only possible
using supercomputing resources can now be run on an
individual workstation, optimizing the scientific workflow
and accelerating the pace of research. These simulations
can also be scaled up to GPU-based clusters of servers
to simulate large molecules and systems that would have
otherwise required a supercomputer.
Applications that are accelerated on GPUs include:
•
1,0
0,5
192 Quad-Core CPUs
It consists of bioscience applications; a community site for
downloading, discussing, and viewing the results of these
applications; and GPU-based platforms.
2 Quad-Core CPUs + 4 GPUs
JAC NVE Benchmark
(left) 192 Quad-Core CPUs simulation run on Kraken Supercomputer
(right) Simulation 2 Intel Xeon Quad-Core CPUs and 4 Tesla M2090 GPUs
Molecular Dynamics &
Quantum Chemistry
AMBER, GROMACS, HOOMD,
LAMMPS, NAMD, TeraChem
(Quantum Chemistry), VMD
• Bio Informatics
CUDA-BLASTP, CUDA-EC, CUDA MEME, CUDASW++ (Smith Waterman), GPU-HMMER,
MUMmerGPU
For more information, visit:
www.nvidia.co.uk/bio_workbench
6
AMBER
AMBER: NODE PROCESSING COMPARISON
4 Tesla M2090 GPUs
+ 2 CPUs
192 Quad-Core CPUs
69 ns/day
46 ns/day
HIGH
PERFORMANCE
COMPUTING
Researchers today are solving the world’s
most challenging and important problems.
From cancer research to drugs for AIDS,
computational research is bottlenecked by
simulation cycles per day. More simulations
mean faster time to discovery. To tackle these
difficult challenges, researchers frequently rely
on national supercomputers for computer
simulations of their models.
GPUs offer every researcher supercomputer-like
performance in their own office. Benchmarks
have shown four Tesla M2090 GPUs
significantly outperforming the existing world
record on CPU-only supercomputers.
RECOMMENDED HARDWARE CONFIGURATION
Workstation
Server
•
•
•
•
•
•
• 8x Tesla M2090
• Dual-socket Quad-core CPU
• 128 GB System Memory
4xTesla C2075
Dual-socket Quad-core CPU
24 GB System Memory Server
Up to 8x Tesla M2090s in cluster
Dual-socket Quad-core CPU per node
128 GB System Memory
180
3T+C2075
3T
160
6T+2xC2075
140
114.9
6T
12T+4xC2075
120
0
74.4
85.7
cubic
dodec
26.9
21.9
34.4
36.7
12.1
15.6
8.7
cubic
+ pcoupl
47.5
49.7
70.4
44.8
28.5
28.4
29.3
16.0
20
8.9
40
45.6
60
27.7
80
83.9
12T
100
72.4
ns/day
GROMACS is a molecular dynamics package designed
primarily for simulation of biochemical molecules
like proteins, lipids, and nucleic acids that have a lot
complicated bonded interactions. The CUDA port of
GROMACS enabling GPU acceleration supports ParticleMesh-Ewald (PME), arbitrary forms of non-bonded
interactions, and implicit solvent Generalized Born
methods.
165.8
GROMACS
dodec
+ vsites
Figure 4: Absolute performance of GROMACS running
CUDA- and SSE-accelerated non-bonded kernels with PME
on 3-12 CPU cores and 1-4 GPUs. Simulations with cubic and
truncated dodecahedron cells, pressure coupling, as well as virtual
interaction sites enabling 5 fs are shown.
7
MSC NASTRAN, MARC
5X PERFORMANCE BOOST WITH SINGLE GPU OVER SINGLE CORE,
>1.5X WITH 2 GPUs OVER 8 CORE
SOL101, 3.4M DOF
7
1 Core
1 GPU + 1 Core
4 Core (SMP)
8 Core (DMP=2)
2 GPU + 2 Core
(DMP=2)
6
1.8x
5
1.6x
4
5.6x
4.6x
3
2
1
0
3.4M DOF; fs0; Total
3.4M DOF; fs0; Solver
Speed Up
* fs0= NVIDIA PSG cluster node
2.2 TB SATA 5-way striped RAID
• Nastran direct equation solver is GPU accelerated
– Real, symmetric sparse direct factorization
– Handles very large fronts with minimal use of pinned
host memory
– Impacts SOL101, SOL400 that are dominated by
MSCLDL factorization times
– More of Nastran (SOL108, SOL111) will be moved to
GPU in stages
Linux, 96GB memory,
Tesla C2050, Nehalem 2.27ghz
• Support of multi-GPU and for both Linux and Windows
– With DMP> 1, multiple fronts are factorized
concurrently on multiple GPUs; 1 GPU per matrix
domain
– NVIDIA GPUs include Tesla 20-series and Quadro
6000
– CUDA 4.0 and above
SIMULIA Abaqus / STANDARD
REDUCE ENGINEERING SIMULATION TIMES IN HALF
With GPUs, engineers can run Abaqus simulations twice
as fast. A leading car manufacturer, NVIDIA customer,
reduced the simulation time of an engine model from 90
minutes to 44 minutes with GPUs. Faster simulations
enable designers to simulate more scenarios to achieve,
for example, a more fuel efficient engine.
Abaqus 6.12 Multi GPU Execution 24 core 2 host, 48
GB memory per host
1,9
1,8
Speed up vs. CPU only
As products get more complex, the task of innovating
with more confidence has been ever increasingly difficult
for product engineers. Engineers rely on Abaqus to
understand behavior of complex assembly or of new
materials.
1,7
2 GPUs/Host
1,6
1,5
1,4
1,3
1,2
1,1
1
8
1 GPU/Host
1,5
1,5
3,0
3,4
Number of Equations (millions)
3,8
GPU ACCELERATED ENGINEERING
ANSYS: SUPERCOMPUTING FROM YOUR WORKSTATION
WITH NVIDIA TESLA GPU
HIGH
PERFORMANCE
COMPUTING
“
A new feature in ANSYS Mechanical
leverages graphics processing units to
significantly lower solution times for large
analysis problem sizes.”
By Jeff Beisheim,
Senior Software Developer,
ANSYS, Inc.
With ANSYS® Mechanical™ 13.0 and NVIDIA® Professional
GPUs, you can:
• Improve product quality with 2x more design
simulations
• Accelerate time-to-market by reducing engineering
cycles
• Develop high fidelity models with practical solution
times
How much more could you accomplish if simulation times
could be reduced from one day to just a few hours? As an
engineer, you depend on ANSYS Mechanical to design high
quality products efficiently. To get the most out of ANSYS
Mechanical 13.0, simply upgrade your Quadro GPU or
add a Tesla GPU to your workstation, or configure a server
with Tesla GPUs, and instantly unlock the highest levels of
ANSYS simulation performance.
FUTURE DIRECTIONS
As GPU computing trends evolve, ANSYS will continue to
enhance its offerings as necessary for a variety of simulation
products. Certainly, performance improvements will
continue as GPUs become computationally more powerful
and extend their functionality to other areas of ANSYS
software.
20X FASTER SIMULATIONS WITH GPUs
DESIGN SUPERIOR PRODUCTS
WITH CST MICROWAVE STUDIO
RECOMMENDED TESLA CONFIGURATIONS
Workstation
Server
•4x Tesla C2075
•Dual-socket Quad-core CPU
•48 GB System Memory
•4x Tesla M2090
•Dual-socket Quad-core CPU
•48 GB System Memory
23x Faster with
Tesla GPUs
25
Relative Performance vs CPU
What can product engineers achieve if a single simulation
run-time reduced from 48 hours to 3 hours? CST
Microwave Studio is one of the most widely used
electromagnetic simulation software and some of the
largest customers in the world today are leveraging GPUs
to introduce their products to market faster and with more
confidence in the fidelity of the product design.
20
15
9X Faster with
Tesla GPU
10
5
0
2x CPU
2x CPU
+
1x C2075
2x CPU
+
4x C2075
Benchmark: BGA models, 2M to 128M mesh models.
CST MWS, transient solver.
CPU: 2x Intel Xeon X5620.
Single GPU run only on 50M mesh model.
9
NAMD
NAMD 2.8 BENCHMARK
Scientists and researchers equipped with powerful GPU
accelerators have reached new discoveries which were
impossible to find before. See how other computational
researchers are experiencing supercomputer-like
performance in a small cluster, and take your research
to new heights.
Ns/Day
Gigaflops/sec (double precision)
The Team at University of Illinois at Urbana-Champaign
(UIUC) has been enabling CUDA-acceleration on NAMD
since 2007, and the results are simply stunning. NAMD
users are experiencing tremendous speed-ups in their
research using Tesla GPU. Benchmark (see below) shows
that 4 GPU server nodes out-perform 16 CPU server nodes.
It also shows GPUs scale-out better than CPUs with more
nodes.
3.5
3
GPU+CPU
2.5
2
1.5
1
CPU only
0.5
0
2
4
8
12
16
# of Compute Nodes
NAMD 2.8 B1, STMV Benchmark
A Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUs
NAMD courtesy of Theoretical and Computational Bio-physics Group, UIUC
LAMMPS
LAMMPS is a classical molecular dynamics package written
to run well on parallel machines and is maintained and
distributed by Sandia National Laboratories in the USA. It is
a free, open-source code.
LAMMPS has potentials for soft materials (biomolecules,
polymers) and solid-state materials (metals,
semiconductors) and coarse-grained or mesoscopic
systems.
The CUDA version of LAMMPS is accelerated by moving
the force calculations to the GPU.
250x
RECOMMENDED HARDWARE
CONFIGURATION
200x
Workstation
Speedup vs Dual Socket CPUs
300x
• 4xTesla C2075
• Dual-socket Quad-core CPU
• 24 GB System Memory
150x
100x
50x
x
Server
3
6
12
24
# of GPUs
48
Benchmark system details:
Each node has 2 CPUs and 3 GPUs. CPU is Intel Six-Core Xeon
Speedup measured with Tesla M2070 GPU vs without GPU
Source: http://users.nccs.gov/~wb8/gpu/kid_single.htm
10
96
• 2x-4x Tesla M2090 per node
• Dual-socket Quad-core CPU per node
• 128 GB System Memory per node
The latest release of Parallel Computing Toolbox and
MATLAB Distributed Computing Server takes advantage
of the CUDA parallel computing architecture to provide
users the ability to
• Manipulate data on NVIDIA GPUs
• Perform GPU accelerated MATLAB operations
• Integrate users‘ own CUDA kernels into MATLAB
applications
• Compute across multiple NVIDIA GPUs by running
multiple MATLAB workers with Parallel Computing
Toolbox on the desktop and MATLAB Distributed
Computing Server on a compute cluster
NVIDIA and MathWorks have collaborated to deliver the
power of GPU computing for MATLAB users. Available
through the latest release of MATLAB 2010b, NVIDIA
GPU acceleration enables faster results for users of the
Parallel Computing Toolbox and MATLAB Distributed
Computing Server. MATLAB supports NVIDIA® CUDAtm
– enabled GPUs with compute capability version 1.3
or higher, such as Tesla™ 10-series and 20-series GPUs.
MATLAB CUDA support provides the base for GPUaccelerated MATLAB operations and lets you integrate
your existing CUDA kernels into MATLAB applications.
For more information, visit:
www.nvidia.co.uk/matlab
SPEED-UP OF COMPUTATIONS
TESLA BENEFITS
RELATIVE PERFORMANCE OF GPU COMPARED TO
5
•
•
•
•
Double Precision
Single Precision
4
Speed-up
Highest Computational Performance
High-speed double precision operations
Large dedicated memory
High-speed bi-directional
PCIe communication
NVIDIA GPUDirect™ with InfiniBand
Most Reliable
3
• ECC memory
• Rigorous stress testing
2
Best Supported
1
0
0
500
1000
1500
2000 2500
3000 3500
4000
4500
Matrix Size
•
•
•
•
•
•
Professional support network
OEM system integration
Long-term product lifecycle
3 year warranty
Cluster & system management
tools (server products)
Windows remote desktop support
RECOMMENDED TESLA AND QUADRO CONFIGURATIONS
High-End Workstation
Mid-Range Workstation
Entry Workstation
• Two Tesla C2075
• Quadro 4000
• Two quad-core CPUs
• 12 GB system memory
• Tesla C2075
• Quadro 6000
• Quad-core CPU
• 8 GB system memory
• Quadro 4000 GPU
• Single quad-core CPU
• 4 GB system memory
11
HIGH
PERFORMANCE
COMPUTING
MATLAB ACCELERATION ON TESLA® GPUs
LBULTRA PLUG-IN FOR RTT DELTAGEN
LBULTRA DEVELOPED BY FLUIDYNA
FAST FLOW SIMULATIONS DIRECTLY WITHIN
RTT DELTAGEN
IMPROVED OPTIMUM PERFORMANCE
THROUGH EARLY INVOLVEMENT OF
AERODYNAMICS INTO DESIGN
The flow simulation software LBultra works particularly fast
on graphics processing units (GPUs). As a plug-in prototype,
it is tightly integrated into the
high-end 3D visualization software RTT DeltaGen by
RTT. As a consequence, flow simulation computing can be
proceeded directly within RTT DeltaGen.
First of all, a certain scenario is selected, such as analyzing
a spoiler or an outside mirror. Next, various simulation
parameters and boundary conditions such as flow rates
or resolution levels are set, which also influence the
calculation time and result‘s accuracy.
Due to this coupling system, the designer can do a direct
flow simulation of their latest design draft – enabling the
designer to do a parallel check of
aerodynamic features of the vehicle’s design draft.
After determining the overall simulation, the geometry of
the design is handed over to LBultra and the simulation
is started. While the simulation is running, data is being
visualized in realtime in RTT DeltaGen.
Relative Performance vs CPU
In addition, there is an opportunity of exporting simulation
results. These results may then, for instance, be further
analyzed in more detail by experts using highly-capable
fluid mechanics specialist programs and tools.
75x Faster with
3 Tesla GPUs
3x Tesla C2075
12
20X Faster with
1 Tesla GPU
Tesla C2075
Inrel Quad-Core
Xeon
HIGH
PERFORMANCE
COMPUTING
GPU COMPUTING IN FINANCE – CASE STUDIES
BLOOMBERG: GPUs INCREASE ACCURACY AND REDUCE PROCESSING TIME FOR BOND PRICING
Bloomberg implemented an NVIDIA
Tesla GPU computing solution in their
datacenter. By porting their application
to run on the NVIDIA CUDA parallel
processing architecture Bloomberg received
dramatic improvements across the board.
As Bloomberg customers make crucial
buying and selling decisions, they now
have access to the best and most current
pricing information, giving them a serious
competitive trading advantage in a market where timing is everything.
48 GPUs
42x Lower Space
2000 CPUs
$144K
28x Lower Cost
$4 Million
38x Lower Power Cost
$1.2 Million / year
$31K / year
NVIDIA TESLA GPUs USED BY J.P. MORGAN TO RUN RISK CALCULATIONS IN MINUTES, NOT
HOURS
THE CHALLENGE
Risk management is a huge and increasingly costly
focus for the financial services industry. A cornerstone
of J.P. Morgan’s cost-reduction plan to cut the cost of
risk calculation involves accelerating its risk library. It is
imperative to reduce the total cost of ownership of J.P.
Morgan’s risk-management platform and create a leap
forward in the speed with which client requests can be
serviced.
THE SOLUTION
J.P. Morgan’s Equity Derivatives Group added NVIDIA®
Tesla M2070 GPUs to its datacenters. More than half the
equity derivative-focused risk computations run by the bank
have been moved to running on hybrid GPU/CPU-based
systems, NVIDIA Tesla GPUs were deployed in multiple
data centers across the bank’s global offices. J.P. Morgan was
able to seamlessly share the GPUs between tens of global
applications.
THE IMPACT
Utilizing GPUs has accelerated application performance by
40X and delivered over 80 percent savings, enabling greener
data centers that deliver higher performance for the same
power. For J.P. Morgan, this is game-changing technology,
enabling the bank to calculate risk across a range of
products in a matter of minutes rather than overnight. Tesla
GPUs give J.P. Morgan a significant market advantage.
13
Your First Choice in GPU Solutions
NVIDIA Tesla M2090 / M2075 Validated Platforms
Model Number
GPGPU Performance
Tesla M2090
Tesla M2075
Highest Performance
Mid-Range Performance
Seismic processing, CFD, CAE, Financial computing,
Computational chemistry and Physics, Data analytics, Satellite
imaging, Weather modeling
GPU Computing Applications
Peak double precision floating
point performance
665 Gigaflops
515 Gigaflops
Peak single precision floating
point performance
1331 Gigaflops
1030 Gigaflops
Memory bandwidth (ECC off)
177 GBytes/sec
150 GBytes/sec
6 GigaBytes
6 GigaBytes
512
448
Memory size (GDDR5)
CUDA cores
FT48-B7055
GN70-B7056
B7056G70V8HR (BTO)
B7055F48W8HR (BTO)
x1 to x4
x 1 or x 2
GOLD
GOLD
Supported CPU
(2) Intel® Xeon® E5-2600 Series (Sandy Bridge-EP)
16/ 8+8
Number of DIMM Slot
Memory Type (max.
capacity)
R-DDR3 1600/ 1333/ 1066/ 800 w/ ECC (512GB)
U-DDR3 1333/ 1066 w/ ECC (128GB)
Storage Controller
Intel® C602 PCH (SATA 6Gb/s & 3Gb/s)
Networking
PCI Expansion Slots
Standard Model
B7056G70V8HR (BTO)
(3) GbE (shared IPMI NIC), or
(2) 10GbE + (1) GbE (shared IPMI NIC)
Storage Backplane
8-port SAS/ SATA
Please contact us for all BTO Part Numbers
Number of DIMM Slot
Memory Type (max.
capacity)
Storage Controller
Networking
(2) PCI-E (Gen.3) x16 + (2) PCI-E (Gen.3) x8
Number of HDD Bay
(8) hot-swap 3.5"
Intel® C602 PCH
Intel® QPI 8.0/ 7.2/ 6.4GT/s
8/ 4+4
R-DDR3 1600/ 1333/ 1066/ 800 w/ ECC (256GB)
U-DDR3 1600/ 1333/ 1066 w/ ECC (64GB)
Intel® C602 PCH (SATA 6Gb/s & 3Gb/s)
0, 1, 5, 10 (Intel® RSTe 3.0)
RAID Support
0, 1, 5, 10 (Intel® RSTe 3.0)
RAID Support
(2) Intel® Xeon® E5-2600 Series (Sandy Bridge-EP)
Chipset Interconnection
Intel® QPI 8.0/ 7.2/ 6.4GT/s
QuickPath Interconnect
Supported CPU
Chipset
Intel® C602 PCH
Chipset
4U (27.5" in depth)
Enclosure Form Factor
2U (27.56" in depth)
Enclosure Form Factor
FT48-B7055
Model Number
GN70-B7056
Model Number
Power Supply
(1+1) 770W RPSU
PCI Expansion Slots
Standard Model
B7055F48W8HR (BTO)
(3) GbE (shared IPMI NIC), or
(2) 10GbE + (1) GbE (shared IPMI NIC)
(4) PCI-E (Gen.3) x16 + (2) PCI-E (Gen.3) x8 + (1) PCI 32/33MHz
Number of HDD Bay
(8) hot-swap 3.5"
Storage Backplane
(2) 4-port SAS/ SATA
6Gb/s
Please contact us for all BTO Part Numbers
Power Supply
(2+1) 1,540W RPSU
FT72-B7015
FT77-B7015
B7015F72V2R(BTO)
B7015F77V4R (BTO)
x 1 to x 8
x1 to x 8
SILVER
SILVER
FT72-B7015
Model Number
Supported CPU
Number of DIMM Slot
Memory Type (max. capacity)
Storage Controller
Networking
PCI Expansion Slots
Memory Type (max. capacity)
Storage Controller
(2) fixed 2.5" SATA-II
Storage Backplane
N/A
N/A
Multimedia Drive
(4) GbE
(8) PCI-E (Gen.2) x16; (2) PCI-E (Gen.2) x16( w/ x4 link)
(1) PCI-E (Gen.2) x4; (1) PCI 32/33MHz
B7015F72V2R(BTO)
0, 1, 10, 5 (Intel® Matrix RAID)
RAID Support
N/A
Number of HDD Bay
18/ (9+9)
R-DDR3 1333/ 1066/ 800 w/ ECC (144GB)
U-DDR3 1333/ 1066 w/ ECC (48GB)
Intel® ICH10R 6-port SATA-II
Number of DIMM Slot
0, 1, 10, 5 (Intel® Matrix RAID)
Standard Model
Intel® QPI 6.4/ 5.86/ 4.8 GT/s
Chipset Interconnection
18/ (9+9)
R-DDR3 1333/ 1066/ 800 w/ ECC (144GB)
U-DDR3 1333/ 1066 w/ ECC (48GB)
Intel® ICH10R 6-port SATA-II
Multimedia Drive
Intel® (2) 5520 + ICH10R
Chipset
Intel® QPI 6.4/ 5.86/ 4.8 GT/s
RAID Support
(2) Intel® Xeon® 5500/5600 Series (Nehalem/ Westmere)
Supported CPU
Intel® (2) 5520 + ICH10R
Chipset Interconnection
4U (28" in depth)
Enclosure Form Factor
(2) Intel® Xeon® 5500/5600 Series (Nehalem/ Westmere)
Chipset
FT77-B7015
Model Number
4U (28" in depth)
Enclosure Form Factor
(4) GbE
(8) PCI-E (Gen.2) x16; (2) PCI-E (Gen.2) x16( w/ x4 link)
(1) PCI-E (Gen.2) x4; (1) PCI 32/33MHz
Networking
PCI Expansion Slots
Power Supply
(2+1) 2,400W RPSU
Please contact us for all BTO Part Numbers
Standard Model
Number of HDD Bay
B7015F77V4R (BTO)
(4) fixed 2.5" SATA-II
Storage Backplane
N/A
Power Supply
(2+1) 2,400W RPSU
Please contact us for all BTO Part Numbers
GN70-B8236-IL
B8236G70W8HR-HE-IL (BTO)
x 1 or x 2
GOLD
Model Number
GN70-B8236-HE-IL
Enclosure Form Factor
2U (27.56" in depth)
Supported CPU
(2) AMD Opteron™ 6200 Series (Interlagos)
AMD SR5690 + SR5650 + SP5100
Chipset
HyperTransport™ Link 3.0
QuickPath Interconnect
16/ 8+8
Number of DIMM Slot
Memory Type (max. capacity)
Storage Controller
RAID Support
R-DDR3 1600/ 1333/ 1066/ 800 w/ ECC (256GB)
U-DDR3 1333/1066 w/ ECC (128GB)
LSI SAS2008 (SAS 6Gb/s)
0, 1, 1E, 10 (LSI RAID stack)
(3) GbE (shared IPMI NIC)
Networking
PCI Expansion Slots
Standard Model
B8236G70W8HR-HE-IL (BTO)
(2) PCI-E (Gen.2) x16 + (2) PCI-E (Gen.2) x8
Number of HDD Bay
(8) hot-swap 3.5"
Storage Backplane
8-port SAS/ SATA 6Gb/s
Please contact us for all BTO Part Numbers
Power Supply
(1+1) 770W RPSU
15
PNY & TYAN IS COMMITTED TO SUPPORT
TESLA® KEPLER GPU ACCELERATORS1
Tesla K10 GPU Computing Accelerator – Optimized for
single precision applications, the Tesla K10 is a throughput
monster based on the ultra-efficient GK104 Kepler GPU. The
accelerator board features two GK104 GPUs and delivers up to
2x the performance for single precision applications compared
to the previous generation Fermi-based Tesla M2090 in the
same power envelope. With an aggregate performance of
4.58 teraflop peak single precision and 320 gigabytes per
second memory bandwidth for both GPUs put together, the
Tesla K10 is optimized for
computations in seismic,
signal, image processing,
and video analytics.
Tesla K20 GPU Computing Accelerator – Designed for
double precision applications and the broader supercomputing
market, the Tesla K20 delivers 3x the double precision
performance compared to the previous generation Fermibased Tesla M2090, in the same power envelope. Tesla
K20 features a single GK110 Kepler GPU that includes the
Dynamic Parallelism and Hyper-Q features. With more than
one teraflop peak double precision performance, the Tesla
K20 is ideal for a wide range of high performance computing
workloads including climate
and weather modeling, CFD,
CAE, computational physics,
biochemistry simulations, and
computational finance.
TECHNICAL SPECIFICATIONS
TESLA K102
TESLA K20
Peak double precision floating point
performance (board)
0.19 teraflops
To be announced
Peak single precision floating point
performance (board)
4.58 teraflops
To be announced
Number of GPUs
2 x GK104s
1 x GK110
CUDA cores
2 x 1536
To be announced
Memory size per board (GDDR5)
8 GB
To be announced
Memory bandwidth for board (ECC off)3
320 GBytes/sec
To be announced
GPU Computing Applications
Seismic, Image, Signal Processing,
Video analytics
CFD, CAE, Financial computing,
Computational chemistry and Physics,
Data analytics, Satellite imaging,
Weather modeling
Architecture Features
SMX
SMX, Dynamic Parallelism, Hyper-Q
System
Servers only
Servers and Workstations.
Available
May 2012
Q4 2012
1 products and availability is subject to confirmation
2 Tesla K10 specifications are shown as aggregate of two GPUs.
3 With ECC on, 12.5% of the GPU memory is used for ECC bits. So, for example, 6 GB total memory yields 5.25 GB of user available
memory with ECC on.
16
Offering pre-and post sales assistance, three year standard warranty, professional technical support, and an
unwavering commitment to customer satisfaction, our partners and customers experience firsthand why PNY is
considered a market leader in the professional industry.
PNY Professional Solutions are offered in cooperation with qualified distributors, specialty retailers, computer retailers,
and system integrators.
To find a qualified PNY Technologies partner visit : www.pny.eu/wheretobuy.php
Germany, Austria, Switzerland,
Russia, Eastern Europe
Email: vertrieb@pny.eu
Hotline Presales:
+49 (0)2405/40848-55
United Kingdom, Denmark,
Sweden, Finland, Norway
Email: quadrouk@pny.eu
Hotline Presales:
+49 (0)2405/40848-55
France, Belgium, Netherland,
Luxembourg
Email: sales@pny.eu
Hotline Presales:
+33 (0)5 56 13 75 75
WARRANTY
PNY Technologies offers a 3 year manufacturer warranty on all Tesla based systems in accordance to PNY
Technologies’ Terms of Guarantee.
SUPPORT
PNY offers individual support as well as comprehensive online support. Our support websites provide FAQs, the latest
information and technical data sheets to ensure you receive the best performance from your PNY product.
Support Contact Information
Business hours: 9:00 a.m. - 5:00 p.m.
Germany, Austria, Switzerland,
Russia, Eastern Europe
Email: tech-sup-ger@pny.de
Hotline Support:
+49 (0)2405/40848-40
United Kingdom, Denmark,
Sweden, Finland, Norway
Email: tech-sup@pny.eu
Hotline Support:
+49 (0)2405/40848-40
France, Belgium, Netherland,
Luxembourg
Email: tech-sup@pny.eu
Hotline Support:
+33 (0)55613-7532
PNY Technologies Quadro GmbH
Schumanstraße 18a
52146 Würselen
Germany
Tel: +49 (0)2405/40848-0
Fax: +49 (0)2405/40848-99
kontakt@pny.eu
PNY Technologies United Kingdom
Basepoint Business & Innovation Centre
110 Butterfield Great Marlings
Luton, Bedfordshire
Lu2 8DL
United Kingdom
Tel : +44 (0)870 423 1103
Fax: +44 (0)870 423 1104
quadrouk@pny.eu
CONTACT
PNY Technologies Europe
Zac du Phare
9 rue Joseph Cugnot - BP181
33708 Mérignac Cedex
France
Tel: +33 (0)5 56 13 75 75
Fax: +33 (0)5 56 13 75 76
sales@pny.eu
17
HIGH PERFORMANCE
COMPUTING
WHERE TO BUY
www.pny.eu
Download