Title of the research paper: Performance analysis of Multicore

advertisement
Title of the research paper: Performance analysis of Multicore Systems
Research Area: Multicore Systems
Authors: Lakhvinder singh & Harmeet kaur
Faculty mentor: Dayanand.J
Name of the Institution: GURU NANAK DEV ENGG COLLEGE, BIDAR
Abstract: One constant in computing is that the world’s hunger for faster
performance is never satisfied. Every new performance advance in processors leads
to another level of greater performance demands from businesses and consumers.
Today these performance demands are not just for speed, but also for smaller, more
powerful mobile devices, longer battery life, quieter desktop PCs, and—in the
enterprise—better price/performance per watt and lower cooling costs. People want
improvements in productivity, security, multitasking, data protection, game
performance, and many other capabilities.
There’s also a growing demand for more convenient form factors for the home, data
center, and on the go.
Through advances in silicon technology, micro architecture, software, and platformTechnologies, Intel is on a fast-paced trajectory to continuously deliver new
generations of multi-core processors with the superior performance and energyefficiency necessary to meet these demands for years to come.
In mid-2006, we reached new levels of energy-efficient performance with our Intel®
Core™2 Duo processors and Dual-Core
Intel® Xeon® processor 5100 series, both produced with our latest 65-nanometer
(nm) silicon technology and micro-architecture.
Now we’re using world’s first mainstream quad-core processors for both desktop
and mainstream servers—Intel® Core™2 Quad processors, Intel® Core™2Extreme
quad-core processors and others.
This paper explains the advantages and challenges of multi-core processing and the
direction in which Intel is taking multi-core processors to the future. We discuss
many of the benefits you will see as we continue to increase processor performance,
energy, efficiency and capabilities.
Background: For years, Intel customers came to expect a doubling of every 18-24
months in accordance with Moore’s Law. Most of these performance gains came from
Dramatic increases in frequency (from 5 MHz to 3 GHz in the years from 1983 to
2002) and through process technology advancements. Improvements also came
from increases in instructions per cycle (IPC). By 2002, however, increasing
Power densities and the resultant heat began to reveal some limitations in using
predominately frequency as a way of improving performance. So, while Moore’s Law
frequency increases, and IPC improvements continue to play an important role in
performance increases, new thinking is also required.
The best example of this new thinking is multi-core processors.
By putting multiple execution cores into a single processor (as well as continuing to
increase clock frequency), Intel is able to provide even greater multiples of
processing power.
Using multi-core processors, Intel can dramatically increase a computer’s capabilities
and computing resources, providing better responsiveness, improving multithreaded
throughput, and delivering the advantages of parallel computing to properly
threaded mainstream applications
While manufacturing technology continues to improve, reducing the size of single
gates, physical limits of semiconductor-based microelectronics have become a major
design concern. Some effects of these physical limitations can cause significant heat
dissipation and data synchronization problems. The demand for more capable
microprocessors causes CPU designers to use various methods of increasing
performance. Some instruction-level parallelism (ILP) methods like superscalar
pipelining are suitable for many applications, but are inefficient for others that tend
to contain difficult-to-predict code. Many applications are better suited to thread
level parallelism (TLP) methods, and multiple independent CPUs is one common
method used to increase a system's overall TLP. A combination of increased available
space due to refined manufacturing processes and the demand for increased TLP is
the logic behind the creation of multi-core CPUs.
Problem Statement: How to increase the performance of multi-core systems?
Methodology:

Performance of a processor can be increased by increasing Clock speed and
Bus speed.

To increase the speed of processor we need a large cache memory.

We need Transistors for the performance of a processor.
According to MOORE’S Law ”The number of transistor that can be integreated on
single chip keep increasing exponetially and a processor is consider as better
speed by using as many minimum mumber of Transistors.”
A FUNDAMENTAL THEORAM OF MULTI-CORE PROCESSOR.
“MULTI-CORE PROCESSOR takes advantages of a fundamental relationship between
power and frequency.”
By incorporating multiple cores each core is able to run at a lower frequency,dividing
among them the power normally given to a single core.
Multi-Threading
Processor designers have found that since most microprocessors spend a significant
amount of time idly waiting for memory, software parallelism can be leveraged to
hide memory latency. Since memory stalls typically take on the order of 100
processor cycles, a processor pipeline is idle for a significant amount of time.
Table 1 shows the amount of time spent waitingfor memory in some typical
applications, on 2 GHz processors.
For example, we can see that for a workload such as a Web server, there are
sufficient memory stalls such that the average number of machine cycles is
1.5—2.5 per instruction, resulting in the pipeline waiting for memory upto 50% of
the time.
In Figure 3, we can see that less than 50% of the processor’s pipeline is actually
being used to process instructions; the remainder is spent waiting for memory.
By providing additional sets of registers per processor pipeline, multiple software
jobs can be multiplexed onto the pipeline, a technique known as simultaneous multithreading (SMT). Threads areswitched on to the pipeline when another blocks or
waits on memory, thus allowing the pipeline to be utilized potentially to its
maximum.
Figure 4 shows an example with four threads per core. In each core, when a memory
stall occurs, the pipeline switches to another thread, making good use of the pipeline
while the previous memory stall is fulfilled. The tradeoff is latency for bandwidth;
with enough threads, we can completely hide memory latency, provided there is
enough memory bandwidth for the added requests. Successful SMT systems typically
allow for very high memory bandwidth from DRAM, as part of their balanced
architecture.
SMT has a high return on performance in relation to additional transistorcount.
For example, a 50% performance gain may be realized by adding just 10% more
transistors with an SMT approach, in contrast to making the pipeline more complex,
which typically affords a 10% performance gain for a 100% increase in transistors.
Also, implementing multi-core alone doesn’t yield optimal performance—the best
design is typically a balance of multi-core and SMT.
Key Results: Best Energy-Efficient Performance Processor Transistors
• Intel Second Generation Strained Silicon Technology increases transistor
performance 10 to 15 percent without increasing leakage.
• Compared to 90 nm transistor technology, Intel’s enhanced energy-efficient
performance 65 nm transistors provide over 20% improvement in transistor
switching speed and over 30% reduction in transistor switching power.
Discussion: This fundamental relationship between power and frequency can be
effectively used to multiply the number of cores from two to four, and then eight and
more, to deliver continuous increases in performance without increasing power
usage. To do this though, there are many advancements that must be made that are
only achievable by a company like Intel.
These include:
• Continuous advances in silicon process technology from 65 nm to 45 nm and to
32nm to increase transistor density.
In addition, Intel is committed to continuing to deliver superior energy-efficient
performance transistors.
• Enhancing the performance of each core and optimizing it for multi-core through
the introduction of new advanced micro-architectures about every two years.
• Improving the memory subsystem and optimizing data access in ways that ensure
data can be used as fast as possible among all cores. This minimizes latency and
improves efficiency and speed.
• Optimizing the interconnect fabric that connects the cores to improve performance
between cores and memory units.
Scope for future work (if any):

Network-on-chip (NoC):
Network-on-chip (NoC) has emerged as a new paradigm for designing multi core
Sysems. NoC will help to design future multi core Sysems where large numbers of
Intellectual Property (IP) cores are connected to the communication fabric (router
based network) using network interfaces. The network is used for packet switched
on-chip communication among cores. It supports high degree of reusability and
scalability. In this work a scalable network based on Mesh of Tree (MoT) topology
has been presented. MoT interconnection network has the advantage of having small
diameter as well as large bisection width and has a nice recursive structure. These
characteristics make it more powerful than other interconnection networks like
meshes and binary trees. A generic NoC simulator is designed for performance
evaluation in terms of network throughput, latency and power of different topologies
under different traffic situations.

80 core processor
We can bulid 80core processor having performance of 1teraflop.
It will be utilizing an Input power of 78.35W and its Clock speed would be
3.13GHz.
When the cores are not needed then this processor would only need 6.5 watt power
thus it is power saving.
This would serve as being the near future for the CPU industry.
Conclusion: The proximity of multiple CPU cores on the same die allows the cache
coherency circuitry to operate at a much higher clock rate than is possible if the
signals have to travel off-chip. Combining equivalent CPUs on a single die
significantly improves the performance of cache snoop (alternative: Bus snooping)
operations. Put simply, this means that signals between different CPUs travel shorter
distances, and therefore those signals degrade less. These higher quality signals
allow more data to be sent in a given time period since individual signals can be
shorter and do not need to be repeated as often.
References: [1] Interconnections”, DATE’2000, IEEE Press, 2000.pp.250-256.
[2]S. Kumar et al, “A Network on Chip Architecture and Design Methodology”, IEEE
Computer Society Annual Symposium on VLSI, April 2002. pp. 105-112.
[3] SIEMENS, “OMI 324: PI Bus -ver.0.3d”, Munich:Siemens AG, 1994. 35p.
[4] IBM Core connect Bus Architecture,”
http://www.ibm.com/chips/products/coreconnect/.
[5] L. Benini and G. D. Micheli, “Network on Chips: A new SOC paradigm," IEEE
computer, pp. 70-78, January 2002.
[6] W. J. Dally and Brian Towles, “Route Packets, Not Wires: On-Chip
Interconnection Networks," Proceedings of the 38th Design Automation Conference,
ACM/IEEE, Las Vegas, Nevada, USA, pp. 684 -689, June 2001.
[7]”Interconnection Network Architectures”
http://www.wellesley.edu/cs/courses/cs331/notes/notes-networks.pdf, pp. 26-49,
January 2001.
[8] C. Zeferino and A. Susin, SoCIN:A Parametric and Scalable Network on Chip,"
Proc. of the 16th symposium on Integrated circuits and System Design (Sao Paulo,
Brazil). IEEE Computer Society, Press, Los Alamitos, Calif, pp. 169-174, February
2003.
[9] M. Horowitz and B. Dally, “How Scaling Will Change Processor Architecture,”
Proc. Int’l Solid State Circuits Conf. (ISSCC), pp. 132-133, Feb. 2004.
[10]S. Kundu and S. Chattopadhyay, “Mesh-of-Tree Deterministic Routing for
Network-on-Chip Architecture”, ACM Great Lake Symposium on VLSI, Florida, USA,
2008.
[11] P. P. Pande, C. Grecu, M. Jones, A. Ivanov and R. Saleh, ”Performance
Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures”,
IEEE Transaction on Computers, Vol. 54, No. 8, August 2005.
[12] Sotiriadis, P. P. and Chandrakasan, A. 2002. A Bus Energy Model for Deep
Submicron Technology. IEEE Transaction on Very Large Scale Integration (VLSI)
Systems, Vol. 10, No. 3, pp. 341 – 350
[13] www.intel.com
[14] www.intel.com/software/enterprise
Acknowledgements:
The satisfaction and euphoria that accompany the successful completion of any task
would be incomplete without the mention of the people who make it possible, whose
Constant guidance and encouragement crown all the efforts with success.
We consider it our privilege to express my gratitude and respect all those who
guided, inspired and helped us in the completion of the project, the expression in the
Project belongs to those listed below.
We are deeply indebted to Prof.Dayanand.J for having consented to be our project
guide and providing invaluable suggestions during the course of the project work.
We are deeply thankful to Prof. S. Arvind, head of the department, computer
science and engineering, GNDEC for providing us the necessary facility in order to
complete the project successfully.
We would like to express our deep sense of gratitude to our principal Dr. V.D.Mytri.
for this continuous effort in creating a competitive environment in our minds and
encourage us to bring out the best in us.
Lakhvinder singh
Harmeet kaur
Download