Intel® Parallel Studio XE 2016 Beta Program: What's New

Intel® Parallel Studio XE 2016 Beta Program: What's New Contents 1. Introduction ........................................................................................................................................ 3 2. What’s New in Intel® Parallel Studio XE 2016 Beta ...................................................................... 3 2.1. Intel® Parallel Studio XE 2016 Beta, Composer Edition ...................................................... 3 2.1.1. New Directory Layout ........................................................................................................ 3 2.1.2. OpenMP* 4.0 and 4.1 ........................................................................................................ 3 2.1.3. C and C++ Features ........................................................................................................... 4 2.1.4. Fortran Features ................................................................................................................. 4 2.1.5. Intel® Integrated Performance Primitives 9.0 Beta New Features ............................. 4 2.2. Intel® Cluster Checker 3.0 Beta ................................................................................................ 4 2.2.1. Running Intel Cluster Checker ......................................................................................... 5 2.2.2. Sample output .................................................................................................................... 5 2.3. Intel® Data Analytics Acceleration Library (Intel® DAAL) 2016 Beta .................................. 5 2.3.1. Data mining and analysis .................................................................................................. 5 2.3.2. Supervised and unsupervised machine learning ......................................................... 5 2.3.3. Support for local and distributed data sources ............................................................ 6 2.3.4. Data compression and decompression ......................................................................... 6 2.3.5. Data serialization and deserialization............................................................................. 6 2.4. Intel® Math Kernel Library (Intel® MKL) 11.3 Beta ................................................................ 6 2.4.1. MPI wrappers ...................................................................................................................... 6 2.4.2. Inspector-executor API for Sparse BLAS level 2 and 3 functions ............................. 6 2.4.3. Batch GEMM operations ................................................................................................... 6 2.4.4. New random number generators .................................................................................... 6 2.4.5. Other new features ............................................................................................................ 6 2.5. Intel® VTune™ Amplifier XE 2016 Beta ................................................................................... 6 2.5.1. Enhanced Intel® OpenMP* Analysis ................................................................................ 7 1 2.5.2. Intel® MPI Library and OpenMP Multi-rank Analysis on a Compute Node ............ 10 2.5.3. GPU Support Improvements.......................................................................................... 12 2.5.4. General Exploration analysis with confidence indication ......................................... 15 2.5.5. New “Platform” tab replacing “Tasks and Frames” tab ............................................. 16 2.5.6. Driver-less Event-Based Sampling collection ............................................................. 16 2.5.7. Timeline “Super Tiny” bird’s-eye view ......................................................................... 17 2.5.8. New OS and IDE support ................................................................................................ 17 2.6. Intel® Inspector XE 2016 Beta ............................................................................................... 18 2.6.1. 2.7. Intel® Advisor XE 2016 Beta ................................................................................................... 18 2.7.1. 2.8. Support for latest OSes .................................................................................................. 18 Vectorization Advisor ...................................................................................................... 18 Intel® MPI Library 5.1 Beta ...................................................................................................... 23 2.8.1. Usability Improvements.................................................................................................. 23 2.8.2. Intel® True Scale Fabric Support ................................................................................... 24 2.9. Intel® Trace Analyzer and Collector 9.1 Beta ...................................................................... 24 2.9.1. MPI Performance Snapshot ........................................................................................... 24 2.9.2. Improved Hybrid Profiling Support .............................................................................. 25 3. Additional Information.................................................................................................................... 26 4. Legal Information............................................................................................................................. 26 2 1. Introduction Intel® Parallel Studio XE 2016 Beta combines Intel's industry-leading C/C++ and Fortran compilers, performance and parallel libraries, error checking and code robustness tools, and performance profiling tools into a single suite offering. This helps boost application performance and increase the code quality, security, and reliability needed by highperformance computing and enterprise applications. At the same time, the single Intel® Parallel Studio XE suite for Windows* or Linux* eases the procurement of all the necessary tools and simplifies the transition from multicore to manycore processors in the future. Visit our website to learn more. 2. What’s New in Intel® Parallel Studio XE 2016 Beta 2.1. Intel® Parallel Studio XE 2016 Beta, Composer Edition The Intel Parallel Studio XE 2016 Beta, Composer Edition, includes Intel® C++ Compiler XE 16.0 Beta, Intel® [Visual] Fortran Compiler XE 16.0 Beta, debuggers, Intel® Math Kernel Library 11.3 Beta, Intel® Integrated Performance Primitives 9.0 Beta, and Intel® Threading Building Blocks 4.2 Update 3. Windows*, Linux*, and OS X* are supported. This section describes the new features of Intel Parallel Studio XE 2016 Beta, Composer Edition, we are seeking feedback on. To see other changes, please refer to the product release notes. 2.1.1. New Directory Layout The Intel® Parallel Studio XE 2016 Beta includes a redesigned directory layout. The majority of the layout changes are within the compiler and libraries component from the Composer Edition. These changes help reduce duplication of largely unchanged and common content between Intel® Parallel Studio XE 2016 product update releases and other Intel® Software Development Products, and help increase product stability and support. Please refer to the product release notes for additional information. 2.1.2. OpenMP* 4.0 and 4.1 Intel Parallel Studio XE 2016 Beta, Composer Edition, extends OpenMP* 4.0 feature support with added support of the following:     simdlen and safelen for loops Array reductions (Fortran) User-defined reductions (C/C++) Omp-simd collapse(N) clause Additionally, Intel Parallel Studio XE 2016 Beta, Composer Edition, presents initial OpenMP* 4.1 support with partial feature support (from Technical Report 3) for the following: 3   Non-structured data allocation with omp target [enter | exit ] data Asynchronous offload with the nowait clause on omp task   Dependent offload with the depend clause on omp task map clause modifiers always and delete 2.1.3. C and C++ Features The Intel® C++ compiler includes expanded language Standards support. For C customers, there is new C11 support for Unicode strings, C11 anonymous unions and keywords _Alignas, _Alighof, _Static_assert, _Thread_local, _Noreturn, and _Generic. For C++ customers, new C++14 features support includes: generic lambdas, generalized lambda captures, digit separators, the [[deprecated]] attribute, function return type deduction and member initializers and aggregates. Additional feature includes support of the feature test macros (Technical Report SD-6: SG10 Feature Test Recommendations). 2.1.4. Fortran Features Our Fortran customers will be delighted with new feature support for Submodules and enhancements in uninitialized variable run-time checking with [Q]init that extends to local, automatic, and allocated variables of intrinsic numeric type. The Fortran compiler includes additional Fortran Standard support for IMPURE ELEMENTAL (F2008) and further C Interoperability (F2015) from Technical Specification 29113. 2.1.5. Intel® Integrated Performance Primitives 9.0 Beta New Features Intel® Integrated Performance Primitives users can evaluate many new features including new APIs for using external threading for more effective thread support and APIs for external memory allocation that can reduce the memory allocation for different IPP function calls. The library includes CPU dispatcher improvements and new optimized cryptography functions to support SM2/SM3/SM4 algorithm. Users are encouraged to evaluate our new custom dynamic library building tool for creating a custom .DLL or shared library (.so) from IPP static libraries which enables linking with IPP dynamic libraries without the requirement of redistributing them. Please refer to the release notes for more information. 2.2. Intel® Cluster Checker 3.0 Beta Intel® Cluster Checker verifies the configuration and performance of Linux-based clusters and checks compliance with the Intel® Cluster Ready architecture specification. If issues are found, Intel® Cluster Checker diagnoses the problems and may provide recommendations on how to repair the cluster. Intel® Cluster Checker has the following features: Dynamic detection and diagnosis of problems with cluster configuration/performance with severity and confidence levels. Tests include, but are not limited to: basic node health checks, disk tests, network checks, environment wellness, MPI functionality, and single node and cluster performance tests. Operational modes include on-demand or asynchronous data collection. 4 2.2.1. Running Intel Cluster Checker Running Intel Cluster Checker is easy! 1. Provide the clck-collect binary with a list of nodes in your cluster 2. Then run clck-collect -f nodefile -a 3. Wait for the tests to execute, then run clck to get a report on the health of your cluster 2.2.2. Sample output This shows I have a difference in firmware in an Intel Xeon Phi card: 6. The SMC boot loader version is inconsistent for the device mic0. 9.22% of devices also have version 1.5.3797. [node(s),severity,confidence,remedy] [compute-0-1.local, 50, 1] [compute-0-29.local, 50, 1] [compute-0-5.local, 50, 1] This shows I'm not getting expected performance from one of my compute nodes: 7. The latest DGEMM result of 1021.81 GFLOPS is only 87.72 percent of the theoretical peak of 1164.80 GFLOPS [node(s): compute-0-29.local] [severity = 3] [confidence = 99] Check out the quick start guide for more details. 2.3. Intel® Data Analytics Acceleration Library (Intel® DAAL) 2016 Beta Intel® Data Analytics Acceleration Library (Intel® DAAL) is a new component of Intel® Parallel Studio XE. It is a library of optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision making. It provides both C++ and Java programming language API, and has many features including: 2.3.1. Data mining and analysis These are algorithms for computing correlation distance and Cosine distance, PCA, Matrix decomposition (SVD, QR, Cholesky), statistical moments, variance-covariance matrices, univariate and multivariate outlier detection, and association rule mining. 2.3.2. Supervised and unsupervised machine learning These algorithms include linear regressions, naïve Bayes classifier, AdaBoost, LogitBoost, and BrownBoost classifiers, SVM, K-Means clustering, Expectation Maximization (EM) for Gaussian Mixture Models (GMM), and so on. 5 2.3.3. Support for local and distributed data sources Intel DAAL supports different data sources, including in-file and in-memory CSV, MySQL, HDFS, and the Resilient Distributed Dataset (RDD) objects of Apache Spark*. 2.3.4. Data compression and decompression Intel DAAL utility routines provide efficient implementations for ZLIB, LZO, RLE, and BZIP2. 2.3.5. Data serialization and deserialization Intel DAAL also provides data serialization and deserialization mechanisms for efficient data communication. 2.4. Intel® Math Kernel Library (Intel® MKL) 11.3 Beta Intel MKL 11.3 Beta introduces the following new features: 2.4.1. MPI wrappers MPI wrappers provide ABI compatibility between different MPI implementations. Users are now able to use Intel MKL seamlessly with most MPI implementations. 2.4.2. Inspector-executor API for Sparse BLAS level 2 and 3 functions The new 2-stage API provides performance benefits to some applications (e.g. iterative solvers), where a matrix structure analysis done in the first (inspector) stage allows better optimizations for operations in the subsequent (executor) stage. Refer to this online article for full documentation of this new feature. 2.4.3. Batch GEMM operations Newly introduced ?GEMM_BATCH and (C/Z)GEMM3M_BATCH functions allow applications to perform multiple independent matrix-matrix multiply operations in parallel. 2.4.4. New random number generators Introduced two new random number generators, ARS-5 (a pseudorandom number generator based on the Intel® AES-NI instruction set) and Philox4x32-10. 2.4.5. Other new features Cluster components are now available for OS X*. A new C-language version of the Intel MKL reference manual is now available. 2.5. Intel® VTune™ Amplifier XE 2016 Beta The Intel VTune Amplifier XE 2016 Beta provides an integrated performance analysis and tuning environment with command line and graphical user interface that helps you analyze code performance on systems with IA-32 or Intel® 64 architectures. To learn more about this product, see the VTune Amplifier XE 2016 Beta Documentation item in the Start menu 6 program folder or use the search feature of Windows 8.x to locate it. This section highlights some of the features of the Intel VTune Amplifier XE 2016 Beta software. 2.5.1. Enhanced Intel® OpenMP* Analysis Potential Gain expansion by parallelization inefficiencies representing their wall time cost CPU time-based classification of Spin and Overhead time in OpenMP runtime does not reveal the elapsed-time impact of a parallel region inefficiency because it depends on the number of working threads. Intel VTune Amplifier XE’s new per-OpenMP region metrics that are based on CPU time are now normalized by the number of threads in the region and represented as an expansion of the “potential gain” metric. Besides reporting the potential gain metric in absolute elapsed time, the Intel VTune Amplifier XE display breaks down the impact of various issues by percentage of total application elapsed time. 7 Precise trace-based imbalance calculation that is especially useful for profiling of small region instances Imbalance of working threads on barriers is a major performance issue that prevents efficient CPU utilization by OpenMP applications. Intel VTune Amplifier XE’s sampling method may miss certain situations of imbalance. For example:    region instances that are smaller than the sampling interval, the number of parallel region instances is insufficient to get statistically correct results, or threads enter a passive wait on a barrier and don’t consume CPU time on a busy wait (e.g. for KMP_BLOCKTIME=0) . To avoid these situations, the Intel OpenMP runtime library from Intel Parallel Studio XE 2016 Beta reports to Intel VTune Amplifier XE the precise imbalance time. This additional information from the OpenMP runtime does not add overhead since the reporting is done on a per-barrier basis. The precise imbalance metrics are displayed when the OpenMP Potential Gain metric is expanded. Detailed analysis by barrier-to-barrier region segments to explore performance of OpenMP work-sharing constructs and barrier cost inside a region When an OpenMP region contains multiple constructs with barriers (e.g., loops with implicit barriers, a ‘single’ construct, or a user barrier), it is useful to distribute inefficiency metrics by barrier-to-barrier segments. Below is an example of a region based on four barrier-to-barrier segments. 8 The Intel OpenMP runtime from Intel Parallel Studio XE 2016 Beta instruments barriers for Intel VTune Amplifier XE to enhance its inefficiency metrics. The barrier type is added to the segment name – loop, single, reduction, etc. The runtime also emits additional information for parallel loops with implicit barriers, such as loop scheduling and chunk size, that is useful in understanding imbalance or the nature of the scheduling overhead. Use the /Barrier-toBarrier Segment grouping to view the statistical distribution by barrier-to-barrier segments. Please note that the same lexical loop constructs with different schedule types or chunk sizes will be displayed separately in different rows. For example, if one instance had a chunk size of 1000 and another had a chunk size of 1563, there would be two entries for the construct with the same name but different sizes in the OpenMP Loop Chunk column. Barrier-to-Barrier Segments are also available on the timeline. 9 2.5.2. Intel® MPI Library and OpenMP Multi-rank Analysis on a Compute Node Per-rank Intel MPI Library communication busy wait time detection and display of metric in summary, grid and timeline views For hybrid MPI and OpenMP applications, it is important to explore OpenMP inefficiency along with MPI communication between ranks. Intel VTune Amplifier XE recognizes samples in the Intel MPI Library communication busy wait functions and shows metrics based on that information. For multi-rank OpenMP results, Intel VTune Amplifier XE’s Summary view is enriched with a table of Top MPI ranks with OpenMP metrics sorted by MPI Communication Spin time from low to high values. The lower the Communication time the more the rank was executing (vs. spinning) and the more impact OpenMP tuning will have on the application elapsed time. Process names are hyperlinked to the Bottom-up view with ‘/Process /OpenMP Region/ …’ groupings to get details of the OpenMP metrics aggregated per-process, with the ability to expand the results by Regions and Barrier-to-Barrier Segments. 10 MPI Communication Spin time is highlighted on the timeline. Intel MPI Library selective rank profiling configuration option, including EBS analysis, for multiple ranks on a node To simplify selective rank profiling configuration for Intel VTune Amplifier XE analysis of MPI applications, the Intel MPI Library introduced ‘-gtool’ option in version 5.0.2. The option syntax is: $ mpirun -genvall –gtool “amplxe-cl -r <my_result> -collect <analysis type>:<rank_set>[=exclusive]” -n <n> <my_app> [my_app_ options] where <rank_set> specifies the rank range to be included in the Intel VTune Amplifier XE analysis. Separate ranks with a comma or use the “-” symbol for a set of contiguous ranks. Use the ‘all’ value to configure profiling on all the ranks. Exclusive launch mode helps prevent running more than one collection per node, which is a limitation of EBS profiling. Starting with Intel MPI Library 5.0.3, the ‘node-wide’ clause can be used instead of ‘exclusive’ to make collection on all ranks of the nodes on which the <rank_set> resides, or for all nodes in the case of ‘all’ ranks. In this case, Intel VTune Amplifier XE will create a result directory per node with host name suffix for the result directory name. This is particularly convenient for EBS collection, where there are limitations on simultaneous profiling by multiple Intel VTune Amplifier XE command lines. 11 Below is an example of node level profiling: $mpirun –gtool “amplxe-cl –c advanced-hotspots –r my_dir:all=node-wide” –n 4 –ppn 2 my_mpi_app Intel VTune Amplifier XE command line generation for selective rank profiling through Intel® Trace Analyzer and Collector (ITAC) user interface With Intel Trace Analyzer and Collector 9.0.2 and later, you can generate Intel VTune Amplifier XE hotspot analysis command lines for ranks selected in the ITAC graphical user interface: from “Event Timeline” Chart, ‘Function Profile/ Load Balance’ grid view or copy the generated command line for the most CPU bound process from ITAC summary page (see more details https://software.intel.com/en-us/node/541057). 2.5.3. GPU Support Improvements GPU Architecture Diagram annotated with metrics On Windows* systems with Intel® HD Graphics, you may find it easier to analyze your OpenCL application by exploring the GPU hardware metrics per GPU architecture blocks. To do this, choose the Computing Task grouping level in the Graphics window, select an OpenCL kernel of interest and click the Architecture Diagram tab in the Timeline pane. Intel VTune Amplifier XE updates the architecture diagram for your platform with performance data per GPU hardware metrics for the time range during which the selected kernel was executed. 12 GPU analysis on Linux GPU analysis on Linux* targets is now available in Intel VTune Amplifier XE, including:  Support for the OpenCL application analysis (for Intel® HD Graphics) and GPU usage analysis (except for the GPU hardware metrics). Refer to the “GPU Analysis”, “GPU Usage” and “Interpreting GPU OpenCL™ Application Analysis Data” help topics for details on analysis configuration and results interpretation. 13  Intel® Media Server SDK program analysis for Linux systems with Intel HD Graphics. To perform analysis of Intel® Media Server SDK tasks execution over time, make sure to configure your Linux kernel according to the “Intel® Media SDK Program Analysis Configuration” topic in the Intel VTune Amplifier XE help. Select the Trace OpenCL and Intel Media SDK programs (Intel HD Graphics only) option in one of the Algorithm or Custom analysis types. To analyze Intel Media Server SDK tasks, focus on the Timeline pane. If you also enable the Analyze GPU usage option for the collection, use the Graphics window to correlate data for the Intel Media SDK tasks execution with the GPU software queue data. 14  Compute extended counter set support added for GPU hardware metrics analysis on the 5th Generation Intel® Core™ processors (Intel® microarchitecture code name Broadwell).  The Global/local accesses hardware event set for GPU analysis has been renamed Compute basic (with global/local memory accesses) to better represent the collected data. See the description in the "GPU Metrics” topic of the product help for detailed metrics. 2.5.4. General Exploration analysis with confidence indication Some of the metrics in Intel VTune Amplifier XE views may now be marked as unreliable by greying out the values in the following views: Summary, Bottom-up, and Source. This can happen when the amount of collected event samples is too low to reliably calculate the metric. Currently it is used for EBS metrics on General Exploration analysis but it may be extended to more metrics in the future if the feedback is favorable. 15 2.5.5. New “Platform” tab replacing “Tasks and Frames” tab The new “Platform” tab contains everything that was in “Tasks and Frames” tab, but can also represent various platform data, such as GPU Usage/Queue, Bandwidth, and CPU Freq ratio (depending on what was collected). 2.5.6. Driver-less Event-Based Sampling collection Can’t install the Intel event-based sampling driver on Linux because IT won’t let you have root access? Advanced analysis is available even if you can’t install the Intel event-based sampling driver. Driver-less event-based sampling is supported for the Advanced Hotspots, General Exploration and Custom analysis types on Linux* operating systems based on kernel 2.6.32 or higher, which exports CPU PMU programming details over /sys/bus/event_source/devices/cpu/format file system. This driver-less sampling collection mode is based on the Linux perf* functionality. Intel VTune Amplifier XE automatically enables the driver-less collection if the Intel event-based sampling driver cannot be installed during product installation. NOTE: The Intel event-based sampling driver provides additional features not available in perf, such as: 16     Stacks Uncore events Multiple precise events New events for the latest processors, even on older operating systems 2.5.7. Timeline “Super Tiny” bird’s-eye view Timeline analysis of core utilization on modern server and many-core co-processor cards with a large number of ranks/threads is particularly useful with a bird’s-eye view to be able to recognize application phases and behavioral patterns for further data zooming and filtering. Intel VTune Amplifier XE’s “Super Tiny” view shows all application threads at once using a pixel color intensity to reflect Efficient, Spin & Overhead and MPI Communication Time metrics. Timeline hierarchical grouping for “Super Tiny” shows leaves only grouped according to the grouping hierarchy: Access the new view via the context menu: 2.5.8. New OS and IDE support Microsoft* Visual Studio* 2015 integration is supported. 17 Microsoft Windows* 10 “Threshold” tech preview build #9926 has been tested and is supported for hardware analysis, i.e., event-based sampling with or without stack sampling. However, software-based analysis types, e.g., basic hotspots, concurrency, and locks and waits, are not supported and will fail. The team is actively pursuing a fix. 2.6. Intel® Inspector XE 2016 Beta Intel Inspector XE 2016 Beta advances application reliability and quality with powerful dynamic analysis tools for memory and thread error detection. This section is a brief overview of features introduced with the Intel Inspector XE 2016 Beta product. 2.6.1. Support for latest OSes Intel Inspector XE 2016 Beta has been fully updated to support the latest OSes for both Windows* and Linux*. See the Intel Inspector XE 2016 Beta Release Notes for the full list of OSes supported. 2.7. Intel® Advisor XE 2016 Beta Intel® Advisor XE 2016 Beta provides two tools to help ensure your Fortran and native/managed C++ applications take full performance advantage of today’s processors:  Vectorization Advisor is a vectorization analysis tool that lets you identify loops that will benefit most from vectorization, identify what is blocking effective vectorization, explore the benefit of alternative data reorganizations, and increase the confidence that vectorization is safe.  Threading Advisor is a threading design and prototyping tool that lets you analyze, design, tune, and check threading design options without disrupting your normal development. Please check:  Getting Started Guide to understand the core capabilities of the Vectorization Advisor  Vectorization Advisor Frequently Asked Questions for more technical details and tips 2.7.1. Vectorization Advisor Compiler and Performance Data All in One Place Vectorization Advisor integrates performance data with the compiler reports generated by the Intel compiler. This seamless integration provides easy access to all of your data in one place and allows you to see if a loop has been vectorized or not. In addition, if a loop has been vectorized, the tool also shows you the estimated gains achieved by vectorization. 18 Loop Trip Counts Vectorization Advisor includes a new collection that tells you how many times a loop is iterating as well as how many times the loop is called. Knowing these trip counts is critical for vectorizing your code efficiently. 19 Recommendations As part of analyzing your application, Vectorization Advisor provides specific recommendations to improve the vectorization of your code. This advice is context-sensitive, based on the observations made on your running application. 20 Correctness Via Dependency Analysis Vectorization Advisor includes a collector that can analyze the loop-carried dependencies of a loop. This dependency analysis lets you know if it is safe to vectorize a loop via compilersupplied pragmas. 21 Memory Access Pattern Analysis Vectorization Advisor can analyze specific loops and let you know if you are accessing memory in a vector-friendly manner. It can tell you if your memory is aligned properly and also give you information regarding your loop strides. 22 2.8. Intel® MPI Library 5.1 Beta Intel MPI Library 5.1 Beta is a high performance implementation of the MPI standard focused on making applications perform better on Intel® architecture-based clusters and enabling support for multiple fabrics at runtime. This release focuses on usability improvements and support for Intel® True Scale Fabric. 2.8.1. Usability Improvements Troubleshooting Chapter in User’s Guide The new Troubleshooting chapter in the User’s Guide provides three sets of information:  General Intel® MPI Library troubleshooting procedures 23   Typical MPI failures with corresponding output messages and behavior when a failure occurs Recommendations on potential root causes and solutions This chapter is intended to aid users in solving common problems quickly and efficiently. New Application-Specific Features in Automatic Tuner The Automatic Tuner has two new options for Application-Specific tuning. Fast mode is enabled by using --fast on the mpitune command line. This will complete application tuning more quickly by starting from a previously defined tuning. Topology aware tuning is enabled by setting I_MPI_TUNE_TOPOLOGY to enable or by using --rank-placement enable on the mpitune command line. Topology aware tuning optimizes rank placement based on communication patterns. Application Topology in Hydra The –use-app-topology option and the I_MPI_HYDRA_USE_APP_TOPOLOGY environment variable enable Hydra to modify rank placement based on previously gathered statistics and detected cluster topology. In this mode, Hydra will attempt to place ranks to optimally utilize the topology of the cluster based on rank communications as detected in the MPI statistics file. 2.8.2. Intel® True Scale Fabric Support Intel MPI Library 5.1 Beta improves support for Intel® True Scale Fabric by defaulting I_MPI_FABRIC_LIST to tmi,dapl,tcp,ofa if Intel® True Scale Fabric is detected. Be aware that if you have a heterogeneous cluster, this could lead to nodes having different fabric orders, causing failures at job start. 2.9. Intel® Trace Analyzer and Collector 9.1 Beta Intel Trace Analyzer and Collector 9.1 Beta is a graphical tool for understanding MPI application behavior, quickly finding bottlenecks, improving correctness, and achieving high performance for parallel cluster applications. Intel Trace Analyzer and Collector 9.1 Beta adds improved support for profiling hybrid applications and a highly scalable MPI Performance Snapshot. 2.9.1. MPI Performance Snapshot The MPI Performance Snapshot is a lightweight overview of a program’s MPI performance. This tool allows you to collect data showing how your application scales, as well as comparing load balance between MPI, OpenMP*, and application code. 24 2.9.2. Improved Hybrid Profiling Support Intel® Trace Analyzer now allows you to select ranks and generate the appropriate command line to run Intel® VTune™ Amplifier XE on the selected ranks using the gtool option in the Intel® MPI Library. 25 3. Additional Information Please visit our product website to learn more about the Intel® Software Development Tools. If you have any questions regarding the Beta program, please refer to the Intel Parallel Studio XE 2016 Beta page. If you have any requests, issues, or feedback contact us at Intel® Premier Customer Support. 4. Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING 26 OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm This document contains information on products in the design phase of development. MPEG-1, MPEG-2, MPEG-4, H.261, H.263, H.264, MP3, DV, VC-1, MJPEG, AC3, AAC, G.711, G.722, G.722.1, G.722.2, AMRWB, Extended AMRWB (AMRWB+), G.167, G.168, G.169, G.723.1, G.726, G.728, G.729, G.729.1, GSM AMR, GSM FR are international standards promoted by ISO, IEC, ITU, ETSI, 3GPP and other organizations. Implementations of these standards, or the standard enabled platforms may require licenses from various entities, including Intel Corporation. BlueMoon, BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Inside, Cilk, Core Inside, E-GOLD, Flexpipe, i960, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel CoFluent, Intel Core, Intel Inside, Intel Insider, the Intel Inside logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel Sponsors of Tomorrow., the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Intel Xeon Phi, Intel XScale, InTru, the InTru logo, the InTru Inside logo, InTru soundmark, Itanium, Itanium Inside, MCS, MMX, Pentium, Pentium Inside, Puma, skoool, the skoool logo, SMARTi, Sound Mark, Stay With It, The Creators Project, The Journey Inside, Thunderbolt, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, X-GOLD, XMM, X-PMU and XPOSYS are trademarks of Intel Corporation in the U.S. and/or other countries. 27 * Other names and brands may be claimed as the property of others. Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries. Java is a registered trademark of Oracle and/or its affiliates. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright © 2015, Intel Corporation. All rights reserved. 28

Intel® Parallel Studio XE 2016 Beta Program: What's New

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib