Title of the research paper: Performance analysis of Multicore Systems Research Area: Multicore Systems Authors: Lakhvinder singh & Harmeet kaur Faculty mentor: Dayanand.J Name of the Institution: GURU NANAK DEV ENGG COLLEGE, BIDAR Abstract: One constant in computing is that the world’s hunger for faster performance is never satisfied. Every new performance advance in processors leads to another level of greater performance demands from businesses and consumers. Today these performance demands are not just for speed, but also for smaller, more powerful mobile devices, longer battery life, quieter desktop PCs, and—in the enterprise—better price/performance per watt and lower cooling costs. People want improvements in productivity, security, multitasking, data protection, game performance, and many other capabilities. There’s also a growing demand for more convenient form factors for the home, data center, and on the go. Through advances in silicon technology, micro architecture, software, and platformTechnologies, Intel is on a fast-paced trajectory to continuously deliver new generations of multi-core processors with the superior performance and energyefficiency necessary to meet these demands for years to come. In mid-2006, we reached new levels of energy-efficient performance with our Intel® Core™2 Duo processors and Dual-Core Intel® Xeon® processor 5100 series, both produced with our latest 65-nanometer (nm) silicon technology and micro-architecture. Now we’re using world’s first mainstream quad-core processors for both desktop and mainstream servers—Intel® Core™2 Quad processors, Intel® Core™2Extreme quad-core processors and others. This paper explains the advantages and challenges of multi-core processing and the direction in which Intel is taking multi-core processors to the future. We discuss many of the benefits you will see as we continue to increase processor performance, energy, efficiency and capabilities. Background: For years, Intel customers came to expect a doubling of every 18-24 months in accordance with Moore’s Law. Most of these performance gains came from Dramatic increases in frequency (from 5 MHz to 3 GHz in the years from 1983 to 2002) and through process technology advancements. Improvements also came from increases in instructions per cycle (IPC). By 2002, however, increasing Power densities and the resultant heat began to reveal some limitations in using predominately frequency as a way of improving performance. So, while Moore’s Law frequency increases, and IPC improvements continue to play an important role in performance increases, new thinking is also required. The best example of this new thinking is multi-core processors. By putting multiple execution cores into a single processor (as well as continuing to increase clock frequency), Intel is able to provide even greater multiples of processing power. Using multi-core processors, Intel can dramatically increase a computer’s capabilities and computing resources, providing better responsiveness, improving multithreaded throughput, and delivering the advantages of parallel computing to properly threaded mainstream applications While manufacturing technology continues to improve, reducing the size of single gates, physical limits of semiconductor-based microelectronics have become a major design concern. Some effects of these physical limitations can cause significant heat dissipation and data synchronization problems. The demand for more capable microprocessors causes CPU designers to use various methods of increasing performance. Some instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications, but are inefficient for others that tend to contain difficult-to-predict code. Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. A combination of increased available space due to refined manufacturing processes and the demand for increased TLP is the logic behind the creation of multi-core CPUs. Problem Statement: How to increase the performance of multi-core systems? Methodology: Performance of a processor can be increased by increasing Clock speed and Bus speed. To increase the speed of processor we need a large cache memory. We need Transistors for the performance of a processor. According to MOORE’S Law ”The number of transistor that can be integreated on single chip keep increasing exponetially and a processor is consider as better speed by using as many minimum mumber of Transistors.” A FUNDAMENTAL THEORAM OF MULTI-CORE PROCESSOR. “MULTI-CORE PROCESSOR takes advantages of a fundamental relationship between power and frequency.” By incorporating multiple cores each core is able to run at a lower frequency,dividing among them the power normally given to a single core. Multi-Threading Processor designers have found that since most microprocessors spend a significant amount of time idly waiting for memory, software parallelism can be leveraged to hide memory latency. Since memory stalls typically take on the order of 100 processor cycles, a processor pipeline is idle for a significant amount of time. Table 1 shows the amount of time spent waitingfor memory in some typical applications, on 2 GHz processors. For example, we can see that for a workload such as a Web server, there are sufficient memory stalls such that the average number of machine cycles is 1.5—2.5 per instruction, resulting in the pipeline waiting for memory upto 50% of the time. In Figure 3, we can see that less than 50% of the processor’s pipeline is actually being used to process instructions; the remainder is spent waiting for memory. By providing additional sets of registers per processor pipeline, multiple software jobs can be multiplexed onto the pipeline, a technique known as simultaneous multithreading (SMT). Threads areswitched on to the pipeline when another blocks or waits on memory, thus allowing the pipeline to be utilized potentially to its maximum. Figure 4 shows an example with four threads per core. In each core, when a memory stall occurs, the pipeline switches to another thread, making good use of the pipeline while the previous memory stall is fulfilled. The tradeoff is latency for bandwidth; with enough threads, we can completely hide memory latency, provided there is enough memory bandwidth for the added requests. Successful SMT systems typically allow for very high memory bandwidth from DRAM, as part of their balanced architecture. SMT has a high return on performance in relation to additional transistorcount. For example, a 50% performance gain may be realized by adding just 10% more transistors with an SMT approach, in contrast to making the pipeline more complex, which typically affords a 10% performance gain for a 100% increase in transistors. Also, implementing multi-core alone doesn’t yield optimal performance—the best design is typically a balance of multi-core and SMT. Key Results: Best Energy-Efficient Performance Processor Transistors • Intel Second Generation Strained Silicon Technology increases transistor performance 10 to 15 percent without increasing leakage. • Compared to 90 nm transistor technology, Intel’s enhanced energy-efficient performance 65 nm transistors provide over 20% improvement in transistor switching speed and over 30% reduction in transistor switching power. Discussion: This fundamental relationship between power and frequency can be effectively used to multiply the number of cores from two to four, and then eight and more, to deliver continuous increases in performance without increasing power usage. To do this though, there are many advancements that must be made that are only achievable by a company like Intel. These include: • Continuous advances in silicon process technology from 65 nm to 45 nm and to 32nm to increase transistor density. In addition, Intel is committed to continuing to deliver superior energy-efficient performance transistors. • Enhancing the performance of each core and optimizing it for multi-core through the introduction of new advanced micro-architectures about every two years. • Improving the memory subsystem and optimizing data access in ways that ensure data can be used as fast as possible among all cores. This minimizes latency and improves efficiency and speed. • Optimizing the interconnect fabric that connects the cores to improve performance between cores and memory units. Scope for future work (if any): Network-on-chip (NoC): Network-on-chip (NoC) has emerged as a new paradigm for designing multi core Sysems. NoC will help to design future multi core Sysems where large numbers of Intellectual Property (IP) cores are connected to the communication fabric (router based network) using network interfaces. The network is used for packet switched on-chip communication among cores. It supports high degree of reusability and scalability. In this work a scalable network based on Mesh of Tree (MoT) topology has been presented. MoT interconnection network has the advantage of having small diameter as well as large bisection width and has a nice recursive structure. These characteristics make it more powerful than other interconnection networks like meshes and binary trees. A generic NoC simulator is designed for performance evaluation in terms of network throughput, latency and power of different topologies under different traffic situations. 80 core processor We can bulid 80core processor having performance of 1teraflop. It will be utilizing an Input power of 78.35W and its Clock speed would be 3.13GHz. When the cores are not needed then this processor would only need 6.5 watt power thus it is power saving. This would serve as being the near future for the CPU industry. Conclusion: The proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at a much higher clock rate than is possible if the signals have to travel off-chip. Combining equivalent CPUs on a single die significantly improves the performance of cache snoop (alternative: Bus snooping) operations. Put simply, this means that signals between different CPUs travel shorter distances, and therefore those signals degrade less. These higher quality signals allow more data to be sent in a given time period since individual signals can be shorter and do not need to be repeated as often. References: [1] Interconnections”, DATE’2000, IEEE Press, 2000.pp.250-256. [2]S. Kumar et al, “A Network on Chip Architecture and Design Methodology”, IEEE Computer Society Annual Symposium on VLSI, April 2002. pp. 105-112. [3] SIEMENS, “OMI 324: PI Bus -ver.0.3d”, Munich:Siemens AG, 1994. 35p. [4] IBM Core connect Bus Architecture,” http://www.ibm.com/chips/products/coreconnect/. [5] L. Benini and G. D. Micheli, “Network on Chips: A new SOC paradigm," IEEE computer, pp. 70-78, January 2002. [6] W. J. Dally and Brian Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks," Proceedings of the 38th Design Automation Conference, ACM/IEEE, Las Vegas, Nevada, USA, pp. 684 -689, June 2001. [7]”Interconnection Network Architectures” http://www.wellesley.edu/cs/courses/cs331/notes/notes-networks.pdf, pp. 26-49, January 2001. [8] C. Zeferino and A. Susin, SoCIN:A Parametric and Scalable Network on Chip," Proc. of the 16th symposium on Integrated circuits and System Design (Sao Paulo, Brazil). IEEE Computer Society, Press, Los Alamitos, Calif, pp. 169-174, February 2003. [9] M. Horowitz and B. Dally, “How Scaling Will Change Processor Architecture,” Proc. Int’l Solid State Circuits Conf. (ISSCC), pp. 132-133, Feb. 2004. [10]S. Kundu and S. Chattopadhyay, “Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture”, ACM Great Lake Symposium on VLSI, Florida, USA, 2008. [11] P. P. Pande, C. Grecu, M. Jones, A. Ivanov and R. Saleh, ”Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures”, IEEE Transaction on Computers, Vol. 54, No. 8, August 2005. [12] Sotiriadis, P. P. and Chandrakasan, A. 2002. A Bus Energy Model for Deep Submicron Technology. IEEE Transaction on Very Large Scale Integration (VLSI) Systems, Vol. 10, No. 3, pp. 341 – 350 [13] www.intel.com [14] www.intel.com/software/enterprise Acknowledgements: The satisfaction and euphoria that accompany the successful completion of any task would be incomplete without the mention of the people who make it possible, whose Constant guidance and encouragement crown all the efforts with success. We consider it our privilege to express my gratitude and respect all those who guided, inspired and helped us in the completion of the project, the expression in the Project belongs to those listed below. We are deeply indebted to Prof.Dayanand.J for having consented to be our project guide and providing invaluable suggestions during the course of the project work. We are deeply thankful to Prof. S. Arvind, head of the department, computer science and engineering, GNDEC for providing us the necessary facility in order to complete the project successfully. We would like to express our deep sense of gratitude to our principal Dr. V.D.Mytri. for this continuous effort in creating a competitive environment in our minds and encourage us to bring out the best in us. Lakhvinder singh Harmeet kaur