Intel® Parallel Studio XE 2016 Beta Program: What's New

Intel® Parallel Studio XE 2016 Beta Program:
What's New
Contents
1.
Introduction ........................................................................................................................................ 3
2.
What’s New in Intel® Parallel Studio XE 2016 Beta ...................................................................... 3
2.1.
Intel® Parallel Studio XE 2016 Beta, Composer Edition ...................................................... 3
2.1.1.
New Directory Layout ........................................................................................................ 3
2.1.2.
OpenMP* 4.0 and 4.1 ........................................................................................................ 3
2.1.3.
C and C++ Features ........................................................................................................... 4
2.1.4.
Fortran Features ................................................................................................................. 4
2.1.5.
Intel® Integrated Performance Primitives 9.0 Beta New Features ............................. 4
2.2.
Intel® Cluster Checker 3.0 Beta ................................................................................................ 4
2.2.1.
Running Intel Cluster Checker ......................................................................................... 5
2.2.2.
Sample output .................................................................................................................... 5
2.3.
Intel® Data Analytics Acceleration Library (Intel® DAAL) 2016 Beta .................................. 5
2.3.1.
Data mining and analysis .................................................................................................. 5
2.3.2.
Supervised and unsupervised machine learning ......................................................... 5
2.3.3.
Support for local and distributed data sources ............................................................ 6
2.3.4.
Data compression and decompression ......................................................................... 6
2.3.5.
Data serialization and deserialization............................................................................. 6
2.4.
Intel® Math Kernel Library (Intel® MKL) 11.3 Beta ................................................................ 6
2.4.1.
MPI wrappers ...................................................................................................................... 6
2.4.2.
Inspector-executor API for Sparse BLAS level 2 and 3 functions ............................. 6
2.4.3.
Batch GEMM operations ................................................................................................... 6
2.4.4.
New random number generators .................................................................................... 6
2.4.5.
Other new features ............................................................................................................ 6
2.5.
Intel® VTune™ Amplifier XE 2016 Beta ................................................................................... 6
2.5.1.
Enhanced Intel® OpenMP* Analysis ................................................................................ 7
1
2.5.2.
Intel® MPI Library and OpenMP Multi-rank Analysis on a Compute Node ............ 10
2.5.3.
GPU Support Improvements.......................................................................................... 12
2.5.4.
General Exploration analysis with confidence indication ......................................... 15
2.5.5.
New “Platform” tab replacing “Tasks and Frames” tab ............................................. 16
2.5.6.
Driver-less Event-Based Sampling collection ............................................................. 16
2.5.7.
Timeline “Super Tiny” bird’s-eye view ......................................................................... 17
2.5.8.
New OS and IDE support ................................................................................................ 17
2.6.
Intel® Inspector XE 2016 Beta ............................................................................................... 18
2.6.1.
2.7.
Intel® Advisor XE 2016 Beta ................................................................................................... 18
2.7.1.
2.8.
Support for latest OSes .................................................................................................. 18
Vectorization Advisor ...................................................................................................... 18
Intel® MPI Library 5.1 Beta ...................................................................................................... 23
2.8.1.
Usability Improvements.................................................................................................. 23
2.8.2.
Intel® True Scale Fabric Support ................................................................................... 24
2.9.
Intel® Trace Analyzer and Collector 9.1 Beta ...................................................................... 24
2.9.1.
MPI Performance Snapshot ........................................................................................... 24
2.9.2.
Improved Hybrid Profiling Support .............................................................................. 25
3.
Additional Information.................................................................................................................... 26
4.
Legal Information............................................................................................................................. 26
2
1. Introduction
Intel® Parallel Studio XE 2016 Beta combines Intel's industry-leading C/C++ and Fortran
compilers, performance and parallel libraries, error checking and code robustness tools, and
performance profiling tools into a single suite offering. This helps boost application
performance and increase the code quality, security, and reliability needed by highperformance computing and enterprise applications. At the same time, the single Intel®
Parallel Studio XE suite for Windows* or Linux* eases the procurement of all the necessary
tools and simplifies the transition from multicore to manycore processors in the future. Visit
our website to learn more.
2. What’s New in Intel® Parallel Studio XE 2016 Beta
2.1.
Intel® Parallel Studio XE 2016 Beta, Composer Edition
The Intel Parallel Studio XE 2016 Beta, Composer Edition, includes Intel® C++ Compiler XE
16.0 Beta, Intel® [Visual] Fortran Compiler XE 16.0 Beta, debuggers, Intel® Math Kernel Library
11.3 Beta, Intel® Integrated Performance Primitives 9.0 Beta, and Intel® Threading Building
Blocks 4.2 Update 3. Windows*, Linux*, and OS X* are supported. This section describes the
new features of Intel Parallel Studio XE 2016 Beta, Composer Edition, we are seeking
feedback on. To see other changes, please refer to the product release notes.
2.1.1. New Directory Layout
The Intel® Parallel Studio XE 2016 Beta includes a redesigned directory layout. The majority of
the layout changes are within the compiler and libraries component from the Composer
Edition. These changes help reduce duplication of largely unchanged and common content
between Intel® Parallel Studio XE 2016 product update releases and other Intel® Software
Development Products, and help increase product stability and support. Please refer to the
product release notes for additional information.
2.1.2. OpenMP* 4.0 and 4.1
Intel Parallel Studio XE 2016 Beta, Composer Edition, extends OpenMP* 4.0 feature support
with added support of the following:




simdlen and safelen for loops
Array reductions (Fortran)
User-defined reductions (C/C++)
Omp-simd collapse(N) clause
Additionally, Intel Parallel Studio XE 2016 Beta, Composer Edition, presents initial OpenMP*
4.1 support with partial feature support (from Technical Report 3) for the following:
3


Non-structured data allocation with omp target [enter | exit ] data
Asynchronous offload with the nowait clause on omp task


Dependent offload with the depend clause on omp task
map clause modifiers always and delete
2.1.3. C and C++ Features
The Intel® C++ compiler includes expanded language Standards support. For C customers,
there is new C11 support for Unicode strings, C11 anonymous unions and keywords _Alignas,
_Alighof, _Static_assert, _Thread_local, _Noreturn, and _Generic. For C++ customers, new
C++14 features support includes: generic lambdas, generalized lambda captures, digit
separators, the [[deprecated]] attribute, function return type deduction and member
initializers and aggregates. Additional feature includes support of the feature test macros
(Technical Report SD-6: SG10 Feature Test Recommendations).
2.1.4. Fortran Features
Our Fortran customers will be delighted with new feature support for Submodules and
enhancements in uninitialized variable run-time checking with [Q]init that extends to local,
automatic, and allocated variables of intrinsic numeric type. The Fortran compiler includes
additional Fortran Standard support for IMPURE ELEMENTAL (F2008) and further C
Interoperability (F2015) from Technical Specification 29113.
2.1.5. Intel® Integrated Performance Primitives 9.0 Beta New Features
Intel® Integrated Performance Primitives users can evaluate many new features including new
APIs for using external threading for more effective thread support and APIs for external
memory allocation that can reduce the memory allocation for different IPP function calls. The
library includes CPU dispatcher improvements and new optimized cryptography functions to
support SM2/SM3/SM4 algorithm. Users are encouraged to evaluate our new custom dynamic
library building tool for creating a custom .DLL or shared library (.so) from IPP static libraries
which enables linking with IPP dynamic libraries without the requirement of redistributing
them. Please refer to the release notes for more information.
2.2.
Intel® Cluster Checker 3.0 Beta
Intel® Cluster Checker verifies the configuration and performance of Linux-based clusters and
checks compliance with the Intel® Cluster Ready architecture specification. If issues are found,
Intel® Cluster Checker diagnoses the problems and may provide recommendations on how to
repair the cluster. Intel® Cluster Checker has the following features: Dynamic detection and
diagnosis of problems with cluster configuration/performance with severity and confidence
levels. Tests include, but are not limited to: basic node health checks, disk tests, network
checks, environment wellness, MPI functionality, and single node and cluster performance
tests. Operational modes include on-demand or asynchronous data collection.
4
2.2.1. Running Intel Cluster Checker
Running Intel Cluster Checker is easy!
1. Provide the clck-collect binary with a list of nodes in your cluster
2. Then run clck-collect -f nodefile -a
3. Wait for the tests to execute, then run clck to get a report on the health of your cluster
2.2.2. Sample output
This shows I have a difference in firmware in an Intel Xeon Phi card:
6. The SMC boot loader version is inconsistent for the device
mic0. 9.22% of devices also have version 1.5.3797.
[node(s),severity,confidence,remedy]
[compute-0-1.local, 50, 1]
[compute-0-29.local, 50, 1]
[compute-0-5.local, 50, 1]
This shows I'm not getting expected performance from one of my compute nodes:
7. The latest DGEMM result of 1021.81 GFLOPS is only 87.72
percent of the theoretical peak of 1164.80
GFLOPS
[node(s): compute-0-29.local]
[severity = 3]
[confidence = 99]
Check out the quick start guide for more details.
2.3.
Intel® Data Analytics Acceleration Library (Intel® DAAL) 2016 Beta
Intel® Data Analytics Acceleration Library (Intel® DAAL) is a new component of Intel® Parallel
Studio XE. It is a library of optimized building blocks covering all stages of data analytics: data
acquisition from a data source, preprocessing, transformation, data mining, modeling,
validation, and decision making. It provides both C++ and Java programming language API,
and has many features including:
2.3.1. Data mining and analysis
These are algorithms for computing correlation distance and Cosine distance, PCA, Matrix
decomposition (SVD, QR, Cholesky), statistical moments, variance-covariance matrices,
univariate and multivariate outlier detection, and association rule mining.
2.3.2. Supervised and unsupervised machine learning
These algorithms include linear regressions, naïve Bayes classifier, AdaBoost, LogitBoost, and
BrownBoost classifiers, SVM, K-Means clustering, Expectation Maximization (EM) for Gaussian
Mixture Models (GMM), and so on.
5
2.3.3. Support for local and distributed data sources
Intel DAAL supports different data sources, including in-file and in-memory CSV, MySQL,
HDFS, and the Resilient Distributed Dataset (RDD) objects of Apache Spark*.
2.3.4. Data compression and decompression
Intel DAAL utility routines provide efficient implementations for ZLIB, LZO, RLE, and BZIP2.
2.3.5. Data serialization and deserialization
Intel DAAL also provides data serialization and deserialization mechanisms for efficient data
communication.
2.4.
Intel® Math Kernel Library (Intel® MKL) 11.3 Beta
Intel MKL 11.3 Beta introduces the following new features:
2.4.1. MPI wrappers
MPI wrappers provide ABI compatibility between different MPI implementations. Users are
now able to use Intel MKL seamlessly with most MPI implementations.
2.4.2. Inspector-executor API for Sparse BLAS level 2 and 3 functions
The new 2-stage API provides performance benefits to some applications (e.g. iterative
solvers), where a matrix structure analysis done in the first (inspector) stage allows better
optimizations for operations in the subsequent (executor) stage. Refer to this online article for
full documentation of this new feature.
2.4.3. Batch GEMM operations
Newly introduced ?GEMM_BATCH and (C/Z)GEMM3M_BATCH functions allow applications to
perform multiple independent matrix-matrix multiply operations in parallel.
2.4.4. New random number generators
Introduced two new random number generators, ARS-5 (a pseudorandom number generator
based on the Intel® AES-NI instruction set) and Philox4x32-10.
2.4.5. Other new features
Cluster components are now available for OS X*. A new C-language version of the Intel MKL
reference manual is now available.
2.5.
Intel® VTune™ Amplifier XE 2016 Beta
The Intel VTune Amplifier XE 2016 Beta provides an integrated performance analysis and
tuning environment with command line and graphical user interface that helps you analyze
code performance on systems with IA-32 or Intel® 64 architectures. To learn more about this
product, see the VTune Amplifier XE 2016 Beta Documentation item in the Start menu
6
program folder or use the search feature of Windows 8.x to locate it. This section highlights
some of the features of the Intel VTune Amplifier XE 2016 Beta software.
2.5.1. Enhanced Intel® OpenMP* Analysis
Potential Gain expansion by parallelization inefficiencies representing their wall time cost
CPU time-based classification of Spin and Overhead time in OpenMP runtime does not reveal
the elapsed-time impact of a parallel region inefficiency because it depends on the number of
working threads. Intel VTune Amplifier XE’s new per-OpenMP region metrics that are based
on CPU time are now normalized by the number of threads in the region and represented as
an expansion of the “potential gain” metric.
Besides reporting the potential gain metric in absolute elapsed time, the Intel VTune Amplifier
XE display breaks down the impact of various issues by percentage of total application
elapsed time.
7
Precise trace-based imbalance calculation that is especially useful for profiling of small
region instances
Imbalance of working threads on barriers is a major performance issue that prevents efficient
CPU utilization by OpenMP applications. Intel VTune Amplifier XE’s sampling method may
miss certain situations of imbalance. For example:



region instances that are smaller than the sampling interval,
the number of parallel region instances is insufficient to get statistically correct results,
or
threads enter a passive wait on a barrier and don’t consume CPU time on a busy wait
(e.g. for KMP_BLOCKTIME=0) .
To avoid these situations, the Intel OpenMP runtime library from Intel Parallel Studio XE 2016
Beta reports to Intel VTune Amplifier XE the precise imbalance time. This additional
information from the OpenMP runtime does not add overhead since the reporting is done on
a per-barrier basis. The precise imbalance metrics are displayed when the OpenMP Potential
Gain metric is expanded.
Detailed analysis by barrier-to-barrier region segments to explore performance of OpenMP
work-sharing constructs and barrier cost inside a region
When an OpenMP region contains multiple constructs with barriers (e.g., loops with implicit
barriers, a ‘single’ construct, or a user barrier), it is useful to distribute inefficiency metrics by
barrier-to-barrier segments. Below is an example of a region based on four barrier-to-barrier
segments.
8
The Intel OpenMP runtime from Intel Parallel Studio XE 2016 Beta instruments barriers for
Intel VTune Amplifier XE to enhance its inefficiency metrics. The barrier type is added to the
segment name – loop, single, reduction, etc. The runtime also emits additional information for
parallel loops with implicit barriers, such as loop scheduling and chunk size, that is useful in
understanding imbalance or the nature of the scheduling overhead. Use the /Barrier-toBarrier Segment grouping to view the statistical distribution by barrier-to-barrier segments.
Please note that the same lexical loop constructs with different schedule types or chunk sizes
will be displayed separately in different rows. For example, if one instance had a chunk size of
1000 and another had a chunk size of 1563, there would be two entries for the construct with
the same name but different sizes in the OpenMP Loop Chunk column.
Barrier-to-Barrier Segments are also available on the timeline.
9
2.5.2. Intel® MPI Library and OpenMP Multi-rank Analysis on a Compute Node
Per-rank Intel MPI Library communication busy wait time detection and display of metric in
summary, grid and timeline views
For hybrid MPI and OpenMP applications, it is important to explore OpenMP inefficiency along
with MPI communication between ranks. Intel VTune Amplifier XE recognizes samples in the
Intel MPI Library communication busy wait functions and shows metrics based on that
information. For multi-rank OpenMP results, Intel VTune Amplifier XE’s Summary view is
enriched with a table of Top MPI ranks with OpenMP metrics sorted by MPI Communication
Spin time from low to high values. The lower the Communication time the more the rank was
executing (vs. spinning) and the more impact OpenMP tuning will have on the application
elapsed time.
Process names are hyperlinked to the Bottom-up view with ‘/Process /OpenMP Region/ …’
groupings to get details of the OpenMP metrics aggregated per-process, with the ability to
expand the results by Regions and Barrier-to-Barrier Segments.
10
MPI Communication Spin time is highlighted on the timeline.
Intel MPI Library selective rank profiling configuration option, including EBS analysis, for
multiple ranks on a node
To simplify selective rank profiling configuration for Intel VTune Amplifier XE analysis of MPI
applications, the Intel MPI Library introduced ‘-gtool’ option in version 5.0.2. The option syntax
is:
$ mpirun -genvall –gtool “amplxe-cl -r <my_result> -collect
<analysis type>:<rank_set>[=exclusive]” -n <n> <my_app> [my_app_
options]
where <rank_set> specifies the rank range to be included in the Intel VTune Amplifier XE
analysis. Separate ranks with a comma or use the “-” symbol for a set of contiguous ranks. Use
the ‘all’ value to configure profiling on all the ranks. Exclusive launch mode helps prevent
running more than one collection per node, which is a limitation of EBS profiling.
Starting with Intel MPI Library 5.0.3, the ‘node-wide’ clause can be used instead of ‘exclusive’
to make collection on all ranks of the nodes on which the <rank_set> resides, or for all nodes
in the case of ‘all’ ranks. In this case, Intel VTune Amplifier XE will create a result directory per
node with host name suffix for the result directory name. This is particularly convenient for
EBS collection, where there are limitations on simultaneous profiling by multiple Intel VTune
Amplifier XE command lines.
11
Below is an example of node level profiling:
$mpirun –gtool “amplxe-cl –c advanced-hotspots –r
my_dir:all=node-wide” –n 4 –ppn 2 my_mpi_app
Intel VTune Amplifier XE command line generation for selective rank profiling through
Intel® Trace Analyzer and Collector (ITAC) user interface
With Intel Trace Analyzer and Collector 9.0.2 and later, you can generate Intel VTune Amplifier
XE hotspot analysis command lines for ranks selected in the ITAC graphical user interface:
from “Event Timeline” Chart, ‘Function Profile/ Load Balance’ grid view or copy the generated
command line for the most CPU bound process from ITAC summary page (see more
details https://software.intel.com/en-us/node/541057).
2.5.3. GPU Support Improvements
GPU Architecture Diagram annotated with metrics
On Windows* systems with Intel® HD Graphics, you may find it easier to analyze your OpenCL
application by exploring the GPU hardware metrics per GPU architecture blocks.
To do this, choose the Computing Task grouping level in the Graphics window, select an
OpenCL kernel of interest and click the Architecture Diagram tab in the Timeline pane. Intel
VTune Amplifier XE updates the architecture diagram for your platform with performance data
per GPU hardware metrics for the time range during which the selected kernel was executed.
12
GPU analysis on Linux
GPU analysis on Linux* targets is now available in Intel VTune Amplifier XE, including:

Support for the OpenCL application analysis (for Intel® HD Graphics) and GPU
usage analysis (except for the GPU hardware metrics).
Refer to the “GPU Analysis”, “GPU Usage” and “Interpreting GPU OpenCL™ Application
Analysis Data” help topics for details on analysis configuration and results interpretation.
13

Intel® Media Server SDK program analysis for Linux systems with Intel HD Graphics.
To perform analysis of Intel® Media Server SDK tasks execution over time, make sure to
configure your Linux kernel according to the “Intel® Media SDK Program Analysis
Configuration” topic in the Intel VTune Amplifier XE help.
Select the Trace OpenCL and Intel Media SDK programs (Intel HD Graphics only) option
in one of the Algorithm or Custom analysis types. To analyze Intel Media Server SDK tasks,
focus on the Timeline pane.
If you also enable the Analyze GPU usage option for the collection, use
the Graphics window to correlate data for the Intel Media SDK tasks execution with the
GPU software queue data.
14

Compute extended counter set support added for GPU hardware metrics analysis on the
5th Generation Intel® Core™ processors (Intel® microarchitecture code name Broadwell).

The Global/local accesses hardware event set for GPU analysis has been
renamed Compute basic (with global/local memory accesses) to better represent the
collected data. See the description in the "GPU Metrics” topic of the product help for
detailed metrics.
2.5.4. General Exploration analysis with confidence indication
Some of the metrics in Intel VTune Amplifier XE views may now be marked as unreliable by
greying out the values in the following views: Summary, Bottom-up, and Source. This can
happen when the amount of collected event samples is too low to reliably calculate the
metric.
Currently it is used for EBS metrics on General Exploration analysis but it may be extended to
more metrics in the future if the feedback is favorable.
15
2.5.5. New “Platform” tab replacing “Tasks and Frames” tab
The new “Platform” tab contains everything that was in “Tasks and Frames” tab, but can also
represent various platform data, such as GPU Usage/Queue, Bandwidth, and CPU Freq ratio
(depending on what was collected).
2.5.6. Driver-less Event-Based Sampling collection
Can’t install the Intel event-based sampling driver on Linux because IT won’t let you have root
access? Advanced analysis is available even if you can’t install the Intel event-based sampling
driver.
Driver-less event-based sampling is supported for the Advanced Hotspots, General
Exploration and Custom analysis types on Linux* operating systems based on kernel 2.6.32 or
higher, which exports CPU PMU programming details over
/sys/bus/event_source/devices/cpu/format file system. This driver-less sampling collection
mode is based on the Linux perf* functionality. Intel VTune Amplifier XE automatically enables
the driver-less collection if the Intel event-based sampling driver cannot be installed during
product installation.
NOTE: The Intel event-based sampling driver provides additional features not available in
perf, such as:
16




Stacks
Uncore events
Multiple precise events
New events for the latest processors, even on older operating systems
2.5.7. Timeline “Super Tiny” bird’s-eye view
Timeline analysis of core utilization on modern server and many-core co-processor cards with
a large number of ranks/threads is particularly useful with a bird’s-eye view to be able to
recognize application phases and behavioral patterns for further data zooming and filtering.
Intel VTune Amplifier XE’s “Super Tiny” view shows all application threads at once using a
pixel color intensity to reflect Efficient, Spin & Overhead and MPI Communication Time
metrics. Timeline hierarchical grouping for “Super Tiny” shows leaves only grouped according
to the grouping hierarchy:
Access the new view via the context menu:
2.5.8. New OS and IDE support
Microsoft* Visual Studio* 2015 integration is supported.
17
Microsoft Windows* 10 “Threshold” tech preview build #9926 has been tested and is
supported for hardware analysis, i.e., event-based sampling with or without stack sampling.
However, software-based analysis types, e.g., basic hotspots, concurrency, and locks and
waits, are not supported and will fail. The team is actively pursuing a fix.
2.6.
Intel® Inspector XE 2016 Beta
Intel Inspector XE 2016 Beta advances application reliability and quality with powerful
dynamic analysis tools for memory and thread error detection. This section is a brief overview
of features introduced with the Intel Inspector XE 2016 Beta product.
2.6.1. Support for latest OSes
Intel Inspector XE 2016 Beta has been fully updated to support the latest OSes for both
Windows* and Linux*. See the Intel Inspector XE 2016 Beta Release Notes for the full list of
OSes supported.
2.7.
Intel® Advisor XE 2016 Beta
Intel® Advisor XE 2016 Beta provides two tools to help ensure your Fortran and
native/managed C++ applications take full performance advantage of today’s processors:

Vectorization Advisor is a vectorization analysis tool that lets you identify loops that
will benefit most from vectorization, identify what is blocking effective vectorization,
explore the benefit of alternative data reorganizations, and increase the confidence
that vectorization is safe.

Threading Advisor is a threading design and prototyping tool that lets you analyze,
design, tune, and check threading design options without disrupting your normal
development.
Please check:
 Getting Started Guide to understand the core capabilities of the Vectorization Advisor
 Vectorization Advisor Frequently Asked Questions for more technical details and tips
2.7.1. Vectorization Advisor
Compiler and Performance Data All in One Place
Vectorization Advisor integrates performance data with the compiler reports generated by the
Intel compiler. This seamless integration provides easy access to all of your data in one place
and allows you to see if a loop has been vectorized or not. In addition, if a loop has been
vectorized, the tool also shows you the estimated gains achieved by vectorization.
18
Loop Trip Counts
Vectorization Advisor includes a new collection that tells you how many times a loop is
iterating as well as how many times the loop is called. Knowing these trip counts is critical for
vectorizing your code efficiently.
19
Recommendations
As part of analyzing your application, Vectorization Advisor provides specific
recommendations to improve the vectorization of your code. This advice is context-sensitive,
based on the observations made on your running application.
20
Correctness Via Dependency Analysis
Vectorization Advisor includes a collector that can analyze the loop-carried dependencies of a
loop. This dependency analysis lets you know if it is safe to vectorize a loop via compilersupplied pragmas.
21
Memory Access Pattern Analysis
Vectorization Advisor can analyze specific loops and let you know if you are accessing
memory in a vector-friendly manner. It can tell you if your memory is aligned properly and
also give you information regarding your loop strides.
22
2.8.
Intel® MPI Library 5.1 Beta
Intel MPI Library 5.1 Beta is a high performance implementation of the MPI standard focused
on making applications perform better on Intel® architecture-based clusters and enabling
support for multiple fabrics at runtime. This release focuses on usability improvements and
support for Intel® True Scale Fabric.
2.8.1. Usability Improvements
Troubleshooting Chapter in User’s Guide
The new Troubleshooting chapter in the User’s Guide provides three sets of information:

General Intel® MPI Library troubleshooting procedures
23


Typical MPI failures with corresponding output messages and behavior when a failure
occurs
Recommendations on potential root causes and solutions
This chapter is intended to aid users in solving common problems quickly and efficiently.
New Application-Specific Features in Automatic Tuner
The Automatic Tuner has two new options for Application-Specific tuning. Fast mode is
enabled by using --fast on the mpitune command line. This will complete application tuning
more quickly by starting from a previously defined tuning. Topology aware tuning is enabled
by setting I_MPI_TUNE_TOPOLOGY to enable or by using --rank-placement enable on the
mpitune command line. Topology aware tuning optimizes rank placement based on
communication patterns.
Application Topology in Hydra
The –use-app-topology option and the I_MPI_HYDRA_USE_APP_TOPOLOGY environment
variable enable Hydra to modify rank placement based on previously gathered statistics and
detected cluster topology. In this mode, Hydra will attempt to place ranks to optimally utilize
the topology of the cluster based on rank communications as detected in the MPI statistics
file.
2.8.2. Intel® True Scale Fabric Support
Intel MPI Library 5.1 Beta improves support for Intel® True Scale Fabric by defaulting
I_MPI_FABRIC_LIST to tmi,dapl,tcp,ofa if Intel® True Scale Fabric is detected. Be
aware that if you have a heterogeneous cluster, this could lead to nodes having different fabric
orders, causing failures at job start.
2.9.
Intel® Trace Analyzer and Collector 9.1 Beta
Intel Trace Analyzer and Collector 9.1 Beta is a graphical tool for understanding MPI
application behavior, quickly finding bottlenecks, improving correctness, and achieving high
performance for parallel cluster applications. Intel Trace Analyzer and Collector 9.1 Beta adds
improved support for profiling hybrid applications and a highly scalable MPI Performance
Snapshot.
2.9.1. MPI Performance Snapshot
The MPI Performance Snapshot is a lightweight overview of a program’s MPI performance.
This tool allows you to collect data showing how your application scales, as well as comparing
load balance between MPI, OpenMP*, and application code.
24
2.9.2. Improved Hybrid Profiling Support
Intel® Trace Analyzer now allows you to select ranks and generate the appropriate command
line to run Intel® VTune™ Amplifier XE on the selected ranks using the gtool option in the Intel®
MPI Library.
25
3. Additional Information
Please visit our product website to learn more about the Intel® Software Development Tools. If
you have any questions regarding the Beta program, please refer to the Intel Parallel Studio
XE 2016 Beta page. If you have any requests, issues, or feedback contact us at Intel® Premier
Customer Support.
4. Legal Information
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.
NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S
TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY
WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING
TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could
result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE
INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL
INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES,
AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL
CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING
26
OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY,
OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER
OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE,
OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions
marked "reserved" or "undefined". Intel reserves these for future definition and shall have no
responsibility whatsoever for conflicts or incompatibilities arising from future changes to
them. The information here is subject to change without notice. Do not finalize a design with
this information.
The products described in this document may contain design defects or errors known as
errata which may cause the product to deviate from published specifications. Current
characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and
before placing your product order.
Copies of documents which have an order number and are referenced in this document, or
other Intel literature, may be obtained by calling 1-800-548-4725, or go to:
http://www.intel.com/design/literature.htm
This document contains information on products in the design phase of development.
MPEG-1, MPEG-2, MPEG-4, H.261, H.263, H.264, MP3, DV, VC-1, MJPEG, AC3, AAC, G.711,
G.722, G.722.1, G.722.2, AMRWB, Extended AMRWB (AMRWB+), G.167, G.168, G.169, G.723.1,
G.726, G.728, G.729, G.729.1, GSM AMR, GSM FR are international standards promoted by ISO,
IEC, ITU, ETSI, 3GPP and other organizations. Implementations of these standards, or the
standard enabled platforms may require licenses from various entities, including Intel
Corporation.
BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside,
E-GOLD, Flexpipe, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel
CoFluent, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel
NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow.,
the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel Xeon Phi, Intel XScale,
InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS,
MMX, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, Stay With
It, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon,
Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the
U.S. and/or other countries.
27
* Other names and brands may be claimed as the property of others.
Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of
Microsoft Corporation in the United States and/or other countries.
Java is a registered trademark of Oracle and/or its affiliates.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors
for optimizations that are not unique to Intel microprocessors. These optimizations include
SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee
the availability, functionality, or effectiveness of any optimization on microprocessors not
manufactured by Intel. Microprocessor-dependent optimizations in this product are
intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice revision #20110804
Copyright © 2015, Intel Corporation. All rights reserved.
28