Performance Analysis: Intel Xeon Phi
Coprocessor 7120P in the Dell
PowerEdge R720 Server
This Dell technical white paper explains the performance gain and
power efficiency with Intel Xeon Phi Coprocessor 7120P on Dell
PowerEdge R720 server.
Saeed Iqbal, Shawn Gao, and Kevin
Tubbs
High Performance Computing
Engineering
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
This document is for informational purposes only and may contain typographical errors and
technical inaccuracies. The content is provided as is, without express or implied warranties of any
kind.
© 2013 Dell Inc. All rights reserved. Dell and its affiliates cannot be responsible for errors or omissions
in typography or photography. Dell, the Dell logo, and PowerEdge are trademarks of Dell Inc. Intel and
Xeon are registered trademarks of Intel Corporation in the U.S. and other countries. Microsoft,
Windows, and Windows Server are either trademarks or registered trademarks of Microsoft Corporation
in the United States and/or other countries. Other trademarks and trade names may be used in this
document to refer to either the entities claiming the marks and names or their products. Dell disclaims
proprietary interest in the marks and names of others.
June 2013 | Rev 1.0
ii
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Contents
Executive Summary ................................................................................................. 5
Introduction .......................................................................................................... 6
The PowerEdge R720 ................................................................................................ 6
Easy to Extend, Easy to Configure .............................................................................. 7
Intel Xeon Phi Coprocessors ...................................................................................... 7
Why Use the PowerEdge R720 for Heterogeneous Computing?............................................... 7
Intel Xeon Processors ............................................................................................. 8
Why Use Bright Cluster Manager for Intel Xeon Phi Coprocessor-Based HPC? ............................. 9
Intel Xeon Phi Coprocessor-Based HPC Features ............................................................. 9
Overview of the Intel Xeon Phi Coprocessor 7120P ........................................................... 10
Test Cluster Configuration ........................................................................................ 11
Benchmarks.......................................................................................................... 12
Results ............................................................................................................... 12
Memory Bandwidth ............................................................................................... 12
High Performance Linpack (HPL) ............................................................................... 14
Summary and Performance/Watt Comparison with CPU only ............................................. 15
Conclusion ........................................................................................................... 16
References........................................................................................................... 16
Tables
Table 1.
Key Features of the Intel Xeon Phi coprocessor 7120P............................................ 10
Table 2.
Compute node configuration detail .................................................................. 11
Table 3.
Benchmarks details ..................................................................................... 12
Table 4.
Results of the performance comparison with CPUs ................................................ 15
Table 5.
Performance/watt comparison ....................................................................... 15
Figures
Figure 1.
The front and back view of PowerEdge R720 ........................................................ 6
Figure 2.
The PowerEdge R720 server ............................................................................ 8
Figure 3.
Host-to-Device (H2D) and Device-to-Host (D2H) BW for the 7120P Coprocessor (MIC-0) .... 13
Figure 4.
The measured Device-to-Device bandwidth of the 7120P coprocessor ......................... 13
Figure 5.
HPL performance results with 7120P coprocessors (percent HPL efficiency) ................. 14
Figure 6.
HPL power consumption results with 7120P coprocessors ........................................ 15
iii
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
iv
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Executive Summary
Organizations that leverage heterogeneous computing architectures are likely to have two key
questions about the latest Intel® Xeon Phi™ Coprocessor: How much performance gain can be expected
from the coprocessor, and what is the power efficiency?
To answer these questions, we measured the performance gain and power efficiency with Intel Xeon
Phi Coprocessor on the Dell™ PowerEdge™ R720 server. Our analysis used standard and synthetic
benchmarks that model real-world applications. In this paper, we present and analyze the results and
highlight key points.
The PowerEdge R720 with the Coprocessor showed up to a 6.6X speedup and 2.5X improvement in the
energy efficiency on High Performance Linpack (HPL) when compared to a CPU-only configuration. HPL
is a common benchmark for high-performance computing (HPC) applications.
5
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Introduction
This white paper explores the performance gain and power efficiency achieved with the Dell
PowerEdge R720 server when accelerated by the Intel Xeon Phi Coprocessor. The following is a brief
overview of the two key technologies explored in this paper—the PowerEdge R720 server series and the
Intel Xeon Phi Coprocessor.
The PowerEdge R720
The Dell PowerEdge R720 is Dell’s 12th generation 2U, 2-sockets, server that is designed to run complex
workloads using highly scalable memory, I/O capacity, and flexible network options. The system
features Intel® Xeon® processor E5-2600 product family, up to 24 DIMMS, up to sixteen 2.5-inch
SATA/SSD internal hard drives giving a maximum of 24TB internal storage. PCI Express® (PCIe) 3.0
enabled expansion slots, and a choice of NIC technologies
The PowerEdge R720 is a general-purpose platform with highly expandable memory (up to 768GB) and
impressive I/O capabilities to match. The R720 can readily handle demanding workloads, such as data
warehouses, e-commerce, virtual desktop infrastructure (VDI), databases, and high-performance
computing (HPC). 1
The PowerEdge R720 is ideal for customers who need to balance high-performance requirements,
including scientific research, oil and gas, life sciences, healthcare, and electronic design automation
(EDA), with resource limitations.
Figure 1.
The front and back view of PowerEdge R720
1
Source from Dell PowerEdge R720 and R720xd Technical Guide. Download at: http://i.dell.com/sites/content/sharedcontent/data-sheets/en/Documents/dell-poweredge-r720-r720xd-technical-guide.pdf
6
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Easy to Extend, Easy to Configure
The PowerEdge R720 enables users to mix and match the requirements of the compute and storage to
find the right combination for particular resource-intensive workloads. With its extendable
architecture, the PowerEdge R720 platform allows organizations to configure storage-dense or
compute-dense server in a general purpose chassis design, and repurpose hardware based on workload
needs. As demands change, the platform can be reconfigured or scaled out, extending the life and
value of the organization’s investments.
Intel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors are based on the Intel Many Integrated Core (MIC) Architecture and use the
familiar x86 standard programming model. It extends hardware support to higher degrees of
parallelism with power savings and shares parallel processing with general purpose processors (CPUs). 2
They also deliver applications for computational finance, computational physics, molecular dynamic,
seismic processing, ray tracing, and finite element analysis. Phi Coprocessors have shown excellent
performance in compute intensive applications requiring double precision floating point operations.
Typically they are targeted to broader supercomputing market but can be used in other workloads.
Intel Xeon Phi Coprocessors are ideal for today’s most aggressive high-performance computing
workloads.
Why Use the PowerEdge R720 for Heterogeneous Computing?
The PowerEdge R720 server is designed to offer an ideal computing platform for compute-intensive
applications. The PowerEdge R720 can accommodate up to two Intel Xeon Phi Coprocessors 7120P,
each with a x16 PCIe Gen2 connection, two Intel® Xeon® E5-2600 series processors with up to eight
cores each, 24 DIMM slots, and up to sixteen 2.5-inch SATA/SSD drives for internal storage per compute
node. Figure 2 shows a PowerEdge R720 compute node.
2
Source: http://software.intel.com/en-us/mic-developer.
7
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Figure 2.
The PowerEdge R720 server
The PowerEdge R720 allows organizations to match the system architecture to the targeted workloads.
With two of the fastest server processors available, and up to two of the most advanced coprocessors
available in one server, users can potentially speed processing up to an order of magnitude. It can be
configured with both CPUs and Coprocessors, and it can be loaded with more local storage and less
compute, or vice-versa, depending on the demands of specific workloads or applications.
Intel Xeon Processors
The Intel Xeon E5-2600 processor family I/O latency is dramatically reduced with Intel Integrated I/O,
which eliminates data bottlenecks, streamlines operations, and increases agility. The E5-2670
processor consumes 115W, but offers a substantial performance gain over the Intel Xeon 5500-series
processor. In addition it delivers the following advantages:

Up to 8-cores per processor

Intel Turbo Boost 2.0 technology

Dual QPI links

The Intel Sandy Bridge micro-architecture
8
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Why Use Bright Cluster Manager for Intel Xeon Phi Coprocessor-Based HPC?
Our test environment uses Bright Cluster Manager® (BCM). BCM is one of the leading feature-rich
cluster management software platforms. It removes the complexity from provisioning, management,
and monitoring of HPC clusters. With BCM, an administrator can easily and quickly install, manage, and
monitor multiple clusters simultaneously from a single GUI.
BCM includes powerful management and monitoring capabilities that leverage functionality in Intel
Xeon Phi coprocessors to take maximum control and gain insight into their status and activity over
time. Supported metrics include temperatures, memory usage, network, PSU voltages and currents,
and system LED states.
BCM allows for alerts and actions to be triggered automatically when the coprocessor metric thresholds
are exceeded. Such rules are configurable to suit the user’s requirements and any built-in cluster
management command, Linux command, or shell script can be used as an action. For example, if a user
would like to automatically receive an email and shut down a node when its coprocessors temperature
exceeds a set value, this can easily be configured in BCM.
Intel Xeon Phi Coprocessor-Based HPC Features
BCM offers the following features specifically for Intel Xeon Phi Coprocessor based HPC:

Ease of setup – BCM provides tools to install and configure coprocessors as accelerators or as
compute nodes. The coprocessors are configured and treated as first class compute devices
giving the same level of monitoring and configuration as all compute nodes

Ease of use - BCM packages drivers, runtime, SDK, OFED, and flash utilities to make
provisioning and integration of compute nodes with Intel Xeon Phi coprocessors very easy.

Ensured kernel compatibility - BCM always compiles the driver at boot time, against the
running kernel, to ensure compatibility. The driver compilation takes about 15 seconds.

Detailed monitoring - BCM captures several important performance metrics, which can be
selected and displayed by users; examples include the temperature, memory usage, and power
information in the cluster.

Management of Intel Xeon Phi Coprocessors - BCM allows user to set alerts to be triggered
automatically based on user defined criteria. These triggers can be integrated with cluster
management commands, command line, and Linux scripts. These features can be extended to
cluster-level status and health management with easy visualization of the entire cluster.3
3
For more information on these BCM monitoring and management capabilities http://info.brightcomputing.com/webinarmanaging-intel-xeon-phi/
9
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Overview of the Intel Xeon Phi Coprocessor 7120P
The Intel Xeon Phi coprocessors were introduced in November 2012. The 7120P coprocessor is the highend coprocessor of the Intel Xeon Phi series, which is specifically targeted toward the HPC market.
Compared to multi-core processor like Intel Xeon, the Phi coprocessor has many lower power cores and
wider vector processing units. Intel implemented the following features to achieve the main design
goal, “more performance and more efficiency”:

Highly Parallel. The coprocessor support three types of parallelism: data parallel, thread
parallel, and process parallel, and delivers higher aggregate performance and memory
bandwidth.

Highly Programmable. The coprocessor is more than an accelerator because it is fully
addressable and dependent in the cluster. Intel Xeon Phi coprocessor is fully supported by Intel
Cluster Studio XE and programmable by standard C/C++/Fortran.
A comparison of the key features is shown in Table 1.
Table 1.
Key Features of the Intel Xeon Phi Coprocessor 7120P
SKU
Architecture Codename
7120P
Knights Corner
Instructions
512-bit SIMD
Cores and
Frequency
61 cores
1.23GHz
Memory and Speed
Power
16 GB
5.5GHz GDDR5
300W
Performance
(Single Precision)
2.4 TFLOPS
Performance
(Double Precision)
1.2 TFLOPS
More detailed information about how the double precision performance number shown above is
determined.4
4
Please see the blog posting by Dr. Mark Fernandez at http://dell.to/YjFuN0
10
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Test Cluster Configuration
The test cluster consisted of two PowerEdge R720 compute nodes with two Intel Xeon Phi Coprocessors
7120P each. Each PowerEdge R720 had a standard dual-socket Sandy Bridge EP motherboard with Intel
Xeon E5-2670 @ 2.6GHz CPU. In the PowerEdge R720 the memory was configured as two DIMMs per
channel and four channels per processor. This means that 16 slots among the 24 DIMM slots were
populated by 8-GB DIMMs for a total of 128 GB @ 1600MHz of memory. There were four internal drives
in each PowerEdge R720 server. The PCIe connections to the coprocessors were internal.
The test cluster used BCM 6.1 as clustering software. BCM 6.1 is based on Red Hat Enterprise Linux
(RHEL) 6.3. Table 2 gives the compute node configuration detail.
Table 2.
Compute node configuration detail
Component
Value
Server
PowerEdge R720
Architecture
Sandy Bridge EP
Processor
Dual Intel Xeon E5-2670 @ 2.6GHz
Memory
128GB @ 1600MHz
Infiniband
Mellanox ConnectX-3 FDR Adapter (CX353A)
Cluster size
2 Servers
MPSS version
2.1.6720-13
Coprocessor Power
Consumption
300W
OS
RHEL 6.3
11
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Benchmarks
The table below describes the benchmarks 5 used in this study.
Table 3.
Benchmarks details
Benchmarks
Domain
Benchmark data set
SHOC_download
Coprocessor to main memory bandwidth
NA
SHOC_readback
Main memory to Coprocessor bandwidth
NA
Stream
Memory bandwidth
NA
High Performance Linpack
Compute Intensive
N=165888, NB=1280
Results
The performance results are shown in this section. Our analysis suggest that the PowerEdge R720 and
7120P coprocessor combination can drive significant gains in terms of both performance and power
efficiency for demanding scientific applications.
Memory Bandwidth
Functionally, the Coprocessors are used as applications accelerators. Initially, with processing
compute-intensive applications, the data is transferred from the host server to the Coprocessor. And
the results produced are sent back to the host. Consequently, high performance is dependent on fast
data transfer between memory (host) and the coprocessor (device).
Three bandwidths are of interest in any accelerated host server:

Host-to-device bandwidth - The rate at with data can be transferred from the host sever
memory to the coprocessor memory via the CPU. Measured by SHOC_download.

Device-to-host bandwidth - The rate at which data can be returned from the coprocessor
memory to the host memory. Measured by SHOC_readback.

Device-to-device memory bandwidth - The rate at which data transfers take place inside the
coprocessor. ECC can affect the device–to-device memory bandwidth. For this study, ECC was
enabled. Measured by Stream.
Note that usually the coprocessors (devices) in the system are marked as MIC-#, where # is the order
number of the coprocessors being listed by command ‘micinfo’.
5
Source: http://ft.ornl.gov/~kspafford/misc/shoc.pdf
12
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Figure 3.
Host-to-Device (H2D) and Device-to-Host (D2H) BW for the 7120P Coprocessor (MIC-0)
Figure 3 shows the two bandwidths for the 7120P coprocessor on a PowerEdge R720 server. As
described in the previous sections, the PowerEdge R720 has two 7120P devices (MIC-0 and MIC-1). The
MIC-0 bandwidths to each of the two CPUs are shown in Figure 3. The host-to-device bandwidth of MIC0 and CPU 0 and CPU 1 were measured at about 6.7 GB/sec, while the device-to-host bandwidth was
measured at 6.8 GB/sec for both CPUs. At this time, the 7120P supports PCIe Gen2 standard only.
Figure 4 shows the device-to-device bandwidth measure at the MIC 0 devices and its comparison with
or without turbo. Figure 4 shows about 2 percent improvement in the internal device-to-device
bandwidth.
Figure 4.
The measured Device-to-Device bandwidth of the 7120P coprocessor
13
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
High Performance Linpack (HPL)
High Performance Linpack (HPL) is a dense linear system solver, which is historically used to benchmark
HPC systems. HPL was run with the problem size of N=116736 (NB=1280) for single node and N=165888
(NB=1280) for cluster runs. The results of the HPL performance are shown in Figure 5. Performance of
the single node (left) and cluster (right) are shown. In each case, the HPL efficiency as a percentage of
the theoretical performance is displayed on the bar. Compared to the CPUs, the acceleration was
about 6X with 7120Ps.
Figure 5.
HPL performance results with 7120P coprocessors (percent HPL efficiency)
On a single node, with CPUs only, the PowerEdge R720 achieved 314 GFLOPS. With two 7120Ps, it
achieved about 2.094 TFLOPS. So the 7120P provides a 6.7X fold performance increase. Similarly, for
the cluster, the combined sustained performance of the four CPUs was 625 GFLOPS. When this is
accelerated using four 7120P coprocessors, the performance increased to 4111.7 GFLOPS, showing an
increase of 6.6X.
In general, the HPL efficiency of CPU-only systems is high, 94.1 percent and 93.9 percent in our tests.
The drop in overall HPL efficiency of the heterogeneous CPU and Coprocessors system compared to a
CPU-only system was expected.
The results of the HPL power consumption tests are shown in Figure 6. The power consumption of the
CPU-only system was lower than CPU + coprocessors systems. The single node results (left) and cluster
consumption (right) are shown. The results show the total power consumption of the compute node
(PowerEdge R720). Numbers on the bars represent the ratio of the CPU + coprocessors systems
compared to the CPU-only case. The power consumption of a single node CPU-only system was about
433 watts. When accelerated by two 7120P coprocessors, it increased to 1120 watts, showing an
increase of 2.6X. Similarly, for the cluster case, the CPU-only power consumption was about 861 watts,
and with four 7120P coprocessors used for acceleration increased to 2243 watts, showing an increase of
2.6X.
14
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Figure 6.
HPL power consumption results with 7120P coprocessors
Summary and Performance/Watt Comparison with CPU only
Table 4 shows the summary of the performance acceleration due to 7120P coprocessors on the dualnode cluster. The HPL results are for dual nodes with two coprocessors per node.
Table 4.
Results of the performance comparison with CPUs
Application
CPU-Only
7120P
Delta
HPL
625 GFLOPS
4111.7 GFLOPS
6.6X increase
Table 5 shows the 7120P performance/watt data of HPL on the dual-node cluster, compared to a CPUonly cluster.
Table 5.
Performance/watt comparison
Application
CPU-Only
7120P
Delta
HPL
0.726 GFLOPS/watt
1.833 GFLOPS/watt
2.5X increase
15
Performance Analysis: Intel® Xeon Phi™ Coprocessor 7120P on Dell™ PowerEdge™ R720 Server
Conclusion
The Intel Xeon Phi Coprocessor 7120P demonstrated substantial performance and power-efficiency
gains when compared to the CPUs only. When two coprocessors were used per node on the HPL
benchmark, compared to CPUs only, the performance is increased by more than six fold and the
performance per watt is improved by more than two fold.
A key finding was that the Intel Xeon Phi Coprocessor 7120P works seamlessly with the PowerEdge 720
server to enhance performance and improve the overall energy consumption, resulting in a powerful,
easy-to-use and energy efficient HPC platform.
References
1. Intel Xeon Processor E5 Family (Servers)
http://ark.intel.com/products/family/59138/Intel-Xeon-Processor-E5-Family/server
2. Intel Xeon Processor E5 Family Specifications
https://www-ssl.intel.com/content/www/us/en/processors/xeon/xeon-e5-family-specupdate.html
3. Intel Xeon Phi Coprocessor
http://software.intel.com/en-us/mic-developer
4. Bright Cluster Manager
http://www.brightcomputing.com/resources/Bright-Cluster-Manager-Brochure.pdf
16