Title Of The Research Paper: Analyzing The Performance Of

advertisement

Title Of The Research Paper:

Analyzing The Performance Of Multicore Systems.

Research Area:

Authors:

Performance Analysis of Multicore systems.

Nasika S. Bilkhis

Faculty mentor:

Neetha Bali B

Poornima K. S.

Madhuri M.

Dr. SrikantaMurthy.

Name of the Institution: PES Institute of Technology

Abstract:

In today’s world multicore processors are in much demand and are being opted by every individual as well as every organization. One of the major challenges faced by the system developers is analyzing the potential performance of a processor and/or system-on-a-chip (SoC) that is based on multicore technology. In this paper

4 methods for analyzing multicore systems are being discussed.

The first method involves readymade benchmark suites developed by EEBMC

(Embedded Microprocessor Benchmark Consortium). This is an easy and readily available process which will analyze the performance of the processors and help the designers choose the best for their application.

The second method uses Amdahl’s law. This law is effective in predicting the theoretical maximum speedup using multiple processors. The various formulae to calculate the effective speedup and its relationship with various factors are dealt in this paper.

The third method is the Intel Vtune Performance Analyzer which has the ability to quickly summarize the performance characteristics which enables designers and engineers to be much more effective at performance tuning within a fixed amount of development time.

The fourth method is the Acumem Virtual Performance Expert (VPE) which can unlock the performance potential of multi-core systems by detecting slow spots in the application code. A further unique capability is that the application performance behavior on other systems can be predicted and visualized, providing the richest information possible for the application enhancement decision.

Background:

About multicore:

A multi-core processor (or chip-level multiprocessor, CMP) combines two or more independent cores (normally a CPU) into a single package composed of a single

Integrated circuit (IC), called a die, or more dies packaged together.

 A dual-core processor contains two cores. Ex: Intel Core 2 Duo.

 A quad-core processor contains four cores. Ex: Intel® Core™2 Quad, Intel Core i7.

Intel core i7. (Next release of Intel)

1.

 A multi-core microprocessor implements multiprocessing in a single physical package.

 A processor with all cores on a single die is called a monolithic processor.

 Each "core" independently implements optimizations such as superscalar execution, pipelining, and multithreading.

 A system with n cores is effective when it is presented with n or more threads concurrently.

 Cores in a multicore device may share a single coherent cache at the highest on-device cache level (e.g. L2 for the Intel Core 2) or may have separate caches (e.g. current AMD dual-core processors).

Use:

The most commercially significant (or at least the most 'obvious') multi-core processors are those used in personal computers (primarily from Intel and AMD) and game consoles (e.g., the eight-core Cell processor in the PS3 and the three-core

Xenon processor in the Xbox 360).

Problem Statement:

This paper deals with the various ways of analyzing the performance of multicore systems. Performance analysis is a very challenging task in itself.

Multicore Association President, Markus Levy has written an article on analyzing multicore processor performance using EEMBC benchmarks which quotes:

“A major challenge lies in analyzing the potential performance of a processor that is based on multicore technology. Not surprisingly, putting multiple execution cores into a single processor (as well as continuing to increase clock frequency), does not guarantee greater multiples of processing power. Furthermore, for any application, there is no assurance that a multicore processor will deliver a dramatic increase in a system’s throughput.”

Methodology:

4 methods for performance analysis have been discussed in this paper:

Using benchmarks:

EEMBC, the Embedded Microprocessor Benchmark Consortium, is a non-profit corporation formed to standardize on real-world, embedded benchmark software to help designers select the right embedded processors for their systems. The result is a collection of "algorithms" and "applications" organized into benchmark suites targeting various products. An additional suite of benchmarks, called MultiBench, specifically targets the capabilities of multicore processors based on SMP architecture. These benchmarks may be obtained by joining EEMBC's open membership or through a corporate or university licensing program. The EEMBC

Technology Center manages development of new benchmark software and certifies benchmark test results. Scores for devices that have been tested and certified can be searched from the Benchmark Score pages.

With the proliferation of multicore processor implementations, the need is growing for performance benchmarks that can give an accurate indication of the value of transitioning from a single core to a multicore system, in addition to determining the impact of system-level bottlenecks, such as those encountered when moving data on and off a multicore chip. EEMBC is addressing this challenge with new multicore benchmark suites that will enable a standardized evaluation of the

2.

benefits of concurrency while providing the scalability needed to support any number of multiple cores.

The benchmarks will target three forms of concurrency, including task decomposition, multiple data stream processing, and the processing of multiple workloads.

 Task decomposition allows multiple threads to cooperate on achieving a unified goal and demonstrates a processor's support for fine grain parallelism.

 Processing of multiple data streams uses common code running over multiple threads and demonstrates how well a solution can scale over scalable data inputs.

 Multiple workload processing shows the scalability of a solution for general-purpose processing and activates concurrency over both code and data.

To implement this strategy on the benchmark level, EEMBC is developing a test harness that will communicate with the benchmark through an abstraction layer that is analogous to an algorithm wrapper. This test harness will provide a flexible interface to allow a wide variety of thread-enabled workloads to be tested.

Amdahl law:

Amdahl's law, also known as Amdahl's argument, is named after computer architect Gene Amdahl, and is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.

The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, refer figure 1, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20x, as the figure illustrates.

Figure 1: illustrates the effect of parallelized execution of a program, each time parallelizing different portions of the program.

Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same when parallelized. For example, if for a given problem size a parallelized implementation of an algorithm can run 12% of the algorithm's operations arbitrarily quick (while the remaining 88% of the operations are not parallelizable), Amdahl's law states that the maximum speedup of the parallelized version is 1/(1 - 0.12) = 1.136 times faster than the nonparallelized implementation.

More technically, the law is concerned with the speedup achievable from an improvement to a computation that affects a proportion P of that computation where the improvement has a speedup of S. (For example, if an improvement can speed up

30% of the computation, P will be 0.3; if the improvement makes the portion affected twice as fast, S will be 2.) Amdahl's law states that the overall speedup of applying the improvement will be

.

In the case of parallelization, Amdahl's law states that if P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 − P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is

3.

In the limit, as N tends to infinity, the maximum speedup tends to 1 / (1-P).

In practice, performance/price falls rapidly as N is increased once there is even a small component of (1 − P).

As an example, if P is 90%, then (1 − P) is 10%, and the problem can be speed up by a maximum of a factor of 10, no matter how large the value of N used.

For this reason, parallel computing is only useful for either small numbers of processors, or problems with very high values of P: so-called embarrassingly parallel problems. A great part of the craft of parallel programming consists of attempting to reduce (1-P) to the smallest possible value.

Intel Vtune performance analyzer.

Intel® VTune™ Performance Analyzer is a solution indispensable for making any software run its fastest on single and multicore systems. It analyzes applications without recompilation or linking on handheld through supercomputer systems. It has the ability to quickly drill down to the source to identify problematic lines of code. It

is robust with large applications (over 1 GB of source code¹) and multicore, multiprocessor, and NuMA systems using the latest Intel® processors.

Features

• Low Overhead Sampling Profiling: System-wide, event-based sampling finds the bottlenecks with low overhead and can be used to tune libraries, drivers, and application programs.

• Call Graph Profiling: Determines calling sequences and graphically displays the critical path, allowing the users to see which functions took the most time to process or were blocked the longest.

• Counter Monitor: Quickly identify system level performance issues using the Counter

Monitor to track system activity and resource consumption during runtime.

• Intel Tuning Assistant: Increase productivity using Intel Tuning Assistant to automatically provide advice based on extensive knowledge.

• New Events for Tuning Multi-core Processors: Identify opportunities to improve threading, tune multi-core sharing of the bus and cache, and optimize cache-line usage.

Performance

• Source and disassembly views allow the answers to be viewed on the source by showing the exact lines of code taking the most time.

• The Counter Monitor indicates whether reduced available memory or performance issues associated with file I/O slow down the application.

• Multi-threading support for load balancing and idle time identification.

4.

Acumem VPE

Acumem Virtual Performance Expert (VPE) can unlock the performance potential of the multi-core system by detecting slow spots in the application code. In addition to standard analysis, Acumem VPE also assesses multi-core cache performance and memory bandwidth, and guides towards enhancing the code. Its predictive capability allows the most efficient allocation of programming effort to boost application speed in the shortest possible time. A further unique capability is that any application performance behavior on other systems can be predicted and visualized, providing the richest information possible for that application enhancement decision.

Solution benefits

• Identify and reduce multi-core performance bottlenecks

• Find and fix slow spots in the application code

• Analyze and improve memory bandwidth utilization

• Predict application performance for the next HP System the application has to be run on.

It allows quantifying and locates performance problems quickly and simply, displaying system level metrics of the potential performance boost for each slow spot identified.

Acumem’s performance productivity tools, Acumem SlowSpotter and Acumem

ThreadSpotter are the most advanced and easy to use tools for optimizing single-

and multithreaded applications of all sizes. Acumem ThreadSpotter also works well for OpenMP and can also analyze MPI applications. They both give hands-on advice based on analysis of cache and memory bandwidth related performance problems for single- and multi-core systems. Thanks to an intuitive GUI and a very low over head,

Acumem tools immediately increases productivity of programmers and allows them to solve complex issues in a matter of minutes, not days.

Zero ramp-up time

Acumem Slow Spotter and ThreadSpotter are started from a GUI and no prior knowledge is required. The user is immediately presented with a high level overview and diagnostics of the applications, along to four major performance areas; Memory

Bandwidth, Memory Latency, Data Locality and for Acumem ThreadSpotter™ also

Thread Communication/Interaction. This initial analysis answers the question - What are the improvement areas and what is the potential?

Increased productivity – takes you to the spot of the crime

Acumem tools pinpoint SlowSpots in the code and explain what the performance issues are and how to go about fixing the problems. The advice is hands on and allows experts as well as non experts to quickly determine where to focus and what to do for their unique application. Each piece of advice is related back to the corresponding source code or data structure. In addition to the advice given by

Acumem SlowSpotter, Acumem ThreadSpotter has a unique set of advice types that has to do with False Sharing, race conditions and other multithread specific problems.

For many applications performance improvements of a factor 2 or more can be achieved by optimizing for how the memory system is used. Often a few lines of code can be responsible for a large share of the performance improvement potential.

Acumem’s performance tools find these opportunities and present them according to priority; making quick performance wins not only a dream but a reality.

Key Results:

The efficiency of performance analysis by the above 4 methods are represented by the pie-chart shown below:

31%

13%

17%

39% benchmark

Amdahl's Law

Vtune

Acumem

Discussion:

The first method i.e using benchmarks enables us to get the performance report about a particular device.But it does not provide a performance comparison between single-core and multi-core systems.Hence given a specific application the designer is unable to decide if transition from single-core to multi-core is required or not.

The second method i.e Amdahl’s law is very efficient in comparing single-core and multi-core systems. But this method comprises of several formulae in analysing the performance since it is not integrated into a single suite as in case of benchmarks.

The other 2 methods i.e the Vtune Analyzer and Acumem VPE are applicationspecific and helps in analyzing the slow spots in the code unlike the first 2 methods.They also help in enhancing the performance and predicting the performance of the application on other systems.

Conclusion:

The 4 methods discussed above have their own advantages and disadvantages.Hence the choice of the method to analyze the performance depends on the application and the designer’s requirements.

Ex: For applications with about 1 GB of source code Vtune analyzer is suitable.

Vtune analyzer takes a has a greatest efficiency compared to the other methods due to the following advantages:

 Low Overhead

 System Wide Analysis

 No Source Code Required

 Sampling Does Not Require Instrumentation

References: http://www.acumem.com/images/stories/AcumemSlowSpotter.pdf

Intel_VTune_Linux_InDepth.pdf http://en.wikipedia.org/wiki/Multi-core_(computing ) http://cache-www.intel.com/cd/00/00/21/92/219271_advantage_vtune.pdf

http://www.eembc.org/

Acknowledgements :

The paper discussed above is the result of the encouragement, guidance and the constant help and support extended towards us by our mentor Dr.Srikanta Murthy,

Department of Information Science & Engineering, PESIT. It is with hearty gratitude that we acknowledge his contributions to our project.

Download