Programming Models and Architectures for ManyCore Systems: Challenges and Opportunities for the next 10 years. Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking National Research Council – Italy roberto.vaccaro@na.icar.cnr.it Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 1 Programming Models and Architectures for…… Introduction CNR Bioinformatics ■ The computational and storage needs of workloads in several areas as life science are growing exponentially. ■ Heterogeneity/Computing Barriers Overcoming. – The scientist should be allowed to look at the data • easily, • wherever it may be, • with sufficient processing power for any desired algorithm to process it. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 2 Programming Models and Architectures for…… Introduction CNR Bioinformatics ■ In life science the scientist requirements concerne a range of different scales, from the local parallel component processor to the global atchitectural level of cross-organizational grid. ■ Integrated solutions capable to face the problems at the different architectural level are needed. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 3 Programming Models and Architectures for…… Introduction CNR Bioinformatics Grid of Clusters Wide Area Netowrk Cluster Local Area Network Commodity Machine System Level Network Microprocessor Network on Chip ■ ManyCore Chip ■ Photonic Networks for intra-chip, inter-chip, box interconnects (*) T. Agerwala, M. Gupta, “Systems research challenges: A scale-out perspective”, IBM Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 173,180 Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 4 Programming Models and Architectures for…… Introduction CNR Bioinformatics ■ An ensemble of N nodes each comprising p computing elements ■ The p elements are tightly bound shared memory (e.g., smp, dsm) ■ The N nodes are loosely coupled, i.e., distributed Memory ■ p is greater than N ■ Distinction is which layer gives us the most power through parallelism Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 5 Programming Models and Architectures for…… Introduction CNR Bioinformatics Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 6 Programming Models and Architectures for…… Introduction CNR Bioinformatics ■ GRIDs built over wide-area networks & across organisational boundaries. ■ lack of (further) improvement in newtork latency. The approach to Distributed Programming currently prevailing synchronous (using RPC primitives for ex.) will have to be replaced with an ASYNCHRONOUS PROGRAMMING APPROACH more - delay-tolerant - failure-resilient Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 7 Programming Models and Architectures for…… Introduction CNR Bioinformatics ■ A first step in that direction - peer-to-peer (P2P) architectures - service-oriented architectures (SOA) capable of support reuse of both functionalities and data. ■ Using P2P architectures and protocols it is possible to - realize distributed systems without any centralized control or hierarchical organisation, - achieve scalable and reliable location and exchange of scientific data and software in a decentralised manner. ■ Service-Oriented Architecture (SOA) and the web-service infrastructures that assist in their implementation facilitate reuse of functionality. (*) G. Kandaswamyetahi “Building Web Services for Scientific Grid Applications”, IBM Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 249,260 Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 8 Programming Models and Architectures for…… Introduction CNR Bioinformatics ■ The possibility to locate and invoke a service across machine and organisational boundaries (both in a synchronous and an asynchronous manner) is provided by SOA infrastructure fundamental primitive. ■ Computational scientist will be able to flexibly orchestrate SOA services into computational workflow. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 9 Programming Models and Architectures for…… Introduction CNR Bioinformatics ■ Appropriate programming languages abstractions for science has to be provided. ■ Fortran and Message Passing Interface (MPI) are no longer appropriate for the above described architecture. ■ By using abstract machines it is possible to mix compilation and interpretation as well as integrate code written language seamlessly into an application or service. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 10 Programming Models and Architectures for…… A viable approach CNR Bioinformatics ■ Define a Multilevel Integrated Programming Model ■ Explore the management of concurrency in processor design on a range of different scales from instructions to programs from microgrids to global grids ■ Evaluate the possibility and modalities to implement an integrated H/W and S/W system capable to give the right answer in terms of: - Inter/intra processor latency. - More delay-tolerant and failure-resilient programming approach. - Capability of data and functionality reuse at global architecture level (distributed, cross-organisational). - Capability to take advantages of parallel and distributed resources. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 11 Programming Models and Architectures for…… Introduction CNR Bioinformatics By Little’s law, the amount of concurrency needed to hide the latency of memory accesses will continue to increase as the gap between memory and processor speed grows. Since the memory latency is improving at a rate of only roughly 6% each year, the gap is projected to continue growing even as the increase in processor speed decreases from the historic rate of about 60% each year to about 20% each year. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 12 Programming Models and Architectures for…… Computer hardware industry CNR Bioinformatics In 2005 a historic change of direction for computer hardware Industry. ● The major microprocessor companies all announced that future products would be single-chip multiprocessors future performance improvements would rely on ○ software-specified parallelism rather than ○ additional software-transparent parallelism extracted automatically by the microarchitecture Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 13 Programming Models and Architectures for…… Computer hardware industry CNR Bioinformatics ■ It is meaningfull that a multibilliondollar industry has bet its future on solving the general-purpose parallel computing problem. even if so many have previously attempted but failed to provide a satisfactory approach. ■ In order to tackle the parallel processing problem, innovative solutions are urgently needed, which in turn require extensive codevelopment of hardware and software. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 14 Programming Models and Architectures for…… Computer hardware industry CNR Bioinformatics ■ Advances in integrated circuit technology impose new challenges about how to implement a high performance application for low power dissipation on processors created by hundred of cores running at 200 MHz, rather than on one traditional processor running at 20 GHz. ■ The convergence of the high-performance and embedded industry. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 15 Programming Models and Architectures for…… Computer hardware industry CNR Bioinformatics Multicore or Manycore? ■Multicore will obviously help multiprogrammed workloads, which contain a mix of independent sequential tasks, but how will individual tasks become faster? ■Switching from sequential to modestly parallel computing will make programming much more difficult without rewarding this greater effort with a dramatic improvement in powerperformance. ■Multicore is unlikely to be ideal answer and sneaking up on the problem of parallelism via multicore solutions was likely to fail. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 16 Programming Models and Architectures for…… Computer hardware industry CNR Bioinformatics ■We desperately need a new solution for parallel hardware and software. ■Compatibility with old binaries and C programs is valuable to industry, and some researchers are trying to help multicore product plans succeed. ■We have been thinking bolder thoughts. Our aim is to realiza thousands of processors on a chip for new applications, and we welcome new programming models and new architectures if theysimplify the efficient programming of such highly parallel systems. ■Rather than multicore, we are, focused on “manycore”. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 17 Programming Models and Architectures for…… Computer hardware industry CNR Bioinformatics ■Between February 2005 and December 2006 a group of Researcher of University of California at Berkeley from many background (circuit design, computer architecture, massively parallel computing, computer-aided design, embedded h/w and s/w, programming languages, compilers, scientific programming and numerical analysis) met to discuss parallelism from these many angles. ■The result of the borrowing the good ideas regarding parallelism from different disciplines is the report. “The Landscape of Parallel Computing Research: A View from Berkeley” Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-183 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html December 18, 2006 Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 18 Programming Models and Architectures for…… The Landscape CNR Bioinformatics Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 19 Programming Models and Architectures for…… The Landscape CNR Bioinformatics ■Seven critical questions used to frame the landscape of parallel computing research: 1. What are the applications? 2. What are common kernels of the applications? 3. What are the hardware building blocks? 4. How to connect them? 5. How to describe applications and kernels? 6. How to program the hardware? 7. How to measure success? ■This report do not have the answers - on some questions non-conventional and provocative perspectives are offered, - On others seemingly obvious sometine-neglected perspectives are stated. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 20 Programming Models and Architectures for…… The Landscape CNR Bioinformatics Embedded versus High Performance Computing Have more in common looking forward than they did in the past 1. Both are concerned with power, whether it is battery life for cell phones or cost of electricity and cooling in a data center. 2. Both are concerned with hardware utilization. Embedded systems are always sensitive to cost, but efficient use of hardware is also required when you spend $ 10M to $ 100M for high-end servers. 3. As the size of embedded software increases over time, the fraction of hand tuning must be limited and so the importance of software reuse must increase. 4. Since both embedded and high-end servers now connect to networks, both need to prevent unwanted accesses and viruses. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 21 Programming Models and Architectures for…… The Landscape CNR Bioinformatics ■The Biggest difference between the two target is the traditional emphasis on realtime computing in embedded, where the computer and the program need to be just fast enough to meet the deadlines, and there is no benefit to running faster. ■Running faster is usually valuable in server computing. ■As server applications become more media-oriented, real time may become more important for server computing as well Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 22 Programming Models and Architectures for…… Information Society Technologies (IST) CNR Bioinformatics Network of Excellence on High Performance Embedded Architectures and Compilers (HiPEAC) Meteo Valero (UPC Barcellona) HiPEAC Coordinator, introducing the pubblication of the first HiPEAC research roadmap (*) wrote: “From the document it is clear that there are many challenges ahead of us in the design of future high-performance embedded systems. Some of them are familiar such as the memory wall, the power problem, and the interconnection bottleneck. Others are new like the proper support for reconfigurable components, fast simulation techniques for multi-core systems, new programming paradigms for parallel programming.” (*) K. De Bosschere, W. Luk, X. Martorell, N. Navarro, M. O’Boyle, D. Pnevmatikatos, A. Ramirez, P. Sainrat, A. Seznec, P. Stentrom, and O. Temam. “High-Performance Embedded Architecture and Compilation Roadmap” Transactions on HiPEAC I, Lecture Notes in Computer Science 4050, pp 529, Springer-Verlag, 2007 Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 23 Programming Models and Architectures for…… Parallelism CNR Bioinformatics For at least three decades the promise of parallelism has fascinated researchers. ■In the past, parallel computing efforts have shown promise and gathered investment, but in the end, uniprocessor computing always prevailed. ■In this time general-purpose computing is taking an irreversible step toward parallel architectures ●This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures for parallelism ●This plunge into parallelism is actually a retreat from aven greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 24 Programming Models and Architectures for…… CW in Computer Architecture CNR Bioinformatics Old & New Conventional Wisdom (CW) in Computer Architecture guiding principles illustrating how everything is changing in computing 1. Old CW: Power is free, but transistors are expensive. ▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 2. Old CW: If you worry about power, the only concern is dynamic power. ▪ New CW: For desktops and servers, static power due to leakage can be 40% of total power. 3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. ▪ New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. 4. Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. ▪ New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability, clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 25 Programming Models and Architectures for…… CW in Computer Architecture CNR Bioinformatics 5. Old CW: Researchers demonstrate new architecture ideas by building chips. ▪New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed. 6. Old CW: Performance improvements yield both lower latency and higher bandwidth. ▪ New CW: Across many technologies, bandwidth improves by at least the square of the improvement in latency. 7. Old CW: Multiply is slow, but load and store is fast. ▪ New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles. 8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-oforder execution, speculation, and Very Long Instruction Word systems. ▪ New CW is the “ILP wall”: There are diminishing returns on finding more ILP. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 26 Programming Models and Architectures for…… CW in Computer Architecture CNR Bioinformatics 9. Old CW: Uniprocessor performance doubles every 18 months. ▪ New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years. 10.Old CW: Don’t bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer. ▪ New CW: It will be a very long wait for a faster sequential computer. 11. Old CW: Increasing clock frequency is the primary method of improving processor performance. ▪ New CW: Increasing parallelism is the primary method of improving processor performance. 12. Old CW: Less than linear scaling for a multiprocessor application is failure. ▪ New CW: Given the switch to parallel computing, any speedup via parallelism is a success. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 27 Programming Models and Architectures for…… CW in Computer Architecture CNR Bioinformatics Conventional Wisdom (CW) in Computer Archietecture 1. Old CW: Power is free, but transistors are expensive. ▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 7. Old CW: Multiply is slow, but load and store is fast. ▪ New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles. 8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-oforder execution, speculation, and Very Long Instruction Word systems. ▪ New CW is the “ILP wall”: There are diminishing returns on finding more ILP. 9. Old CW: Uniprocessor performance doubles every 18 months. ▪ New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 28 Programming Models and Architectures for…… CW in Computer Architecture CNR Bioinformatics Uniprocessor Performance (SPECint) From Hennessy and Patterson Computer Architecture: A Quantitative Approach, 4° edition, 2006 Sea change in chip design: multiple “cores” or processors per chip • VAX: 25%/year 1978 to 1986 • RISC + x86: 52%/yaer 1986 to 2002 • RISC + x86: ??%/year 2002 to present Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 29 Programming Models and Architectures for…… CW in Computer Architecture CNR Bioinformatics The State of Hardware ■A Negative picture about the state of hardware is painted by CW pairs based analysis. ■There are compensating positives as well ●Moore’s Law continues: it will soon be possible to put thausands of simple processors on a single, economical chip; ●Very low latency & very high bandwidth for the communication between these processors within a chip; ●Monolithic manycore microprocessors - represent a very different design point from traditional multichip multiprocessors - provide promise for the development of new architectures and programming models. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 30 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics ■ Mining the parallelism experience of the high-performance computing community to see if there are lessons we can learn for a broader view of parallel computing. The hypothesis ● is not that traditional scientific computing is the future of parallel computing ● is that the body of knowledge created in bulding programs that run well on massively parallel computers may prove useful in parallelizing future applications ■ Many of the authors from other areas, such as embedded computing, were surprised at how well future applications in their domain mapped closely to problems in scientific computing. ■ The way to guide and evaluate architecture innovation is to study a benchmark suite based on existing programs, such as EEMBC (Embedded Microprocessors Benchmark Consortium) or SPEC (Standard Performance Evalution Corporation) or SPLASH (Stanford Parallel Applications for Shared Memory). Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 31 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics ■ It is currently unclear how to express a parallel computation best: a very big obstacle to innovation in parallel computing. ■ It seems unwise to let a set of existing source code drive an investigation into parallel computing. ■ There is a need to find a higher level of abstraction for reasoning about parallel application requirements. ■ The main aim is to delineate application requirements in a manner that is not overly specific to individual applications or the optimizations used for certain hardware platforms. ■ It is possible to draw broader conclusions about hardware requirements. ■ The approach is to define a number of “Dwarfs”, which each capture a pattern of computation and communication common to a class of important applications. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 32 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics ■ Phil Colella identified seven numerical methods that he believed will be important for science and engineering for at least the next decade ■ Seven Dwarfs ● Constitute classes where membership in a class is defined by similarity in computation and data movement ● are specified at a high level of abstraction to allow reasoning about their behavior across a broad range of applications Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 33 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 34 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 35 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 36 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics Seven Dwarfs, their descriptions, corresponding NAS benchmarks, and example computers. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 37 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 38 Programming Models and Architectures for…… Applications and Dwarfs CNR Bioinformatics Extensions to the original Seven Dwarfs. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 39 Programming Models and Architectures for…… Recognition, Mining, Synthesis (RMS) CNR Bioinformatics Intel “Era of Tera” Computation Categories Intel’s RMS and how it maps down to functions that are more primitive. Of the five categories at the top of the figure, Computer Vision is classified as Recognition, Data Mining is Mining, and Rendering, Physical Simulation, and Financial Analytics are Synthesis. [Chen 2006] Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 40 Programming Models and Architectures for…… Parallel Programming Models CNR Bioinformatics Comparison of 10 current parallel programming models for 5 critical tasks, sorted from most explicit to most implicit. High-performance computing applications [Pancake and Bergmark 1990] and embedded applications [Shah et al 2004a] suggest these tasks must be addressed one way or the other by a programming model: 1) Dividing the application into parallel tasks; 2) Mapping computational tasks to processing elements; 3) Distribution of data to memory elements; 4) mapping of communication to the inter-connection network; and 5) Inter-task synchronization. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 41 Programming Models and Architectures for…… Limits of Performance of Dwarfs CNR Bioinformatics Limits to performance of dwarfs, inspired by an suggestion by IBM that a packaging technology could offer virtually infinite memory bandwidth. While the memory wall limited performance for almost half the dwarfs, memory latency is a bigger problem than memory bandwidth Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 42 Programming Models and Architectures for…… Transistor Integration Capacity CNR Bioinformatics Transistor integration capacity Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 43 Programming Models and Architectures for…… Pollack’s Rule CNR Bioinformatics Pollack's Rule Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 44 Programming Models and Architectures for…… Frequency and Power Consumption CNR Bioinformatics Frequency and Power Consumption Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 45 Programming Models and Architectures for…… ManyCore System CNR Bioinformatics Illustration of a Many Core System Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 46 Programming Models and Architectures for…… Amdahl’s Law Limits Parallel Speedup CNR Bioinformatics Amdahl's Law limits parallel speedup Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 47 Programming Models and Architectures for…… Core Performances CNR Bioinformatics Performance of Large, Medium, and Small Cores Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 48 Programming Models and Architectures for…… Fine Grain Power Management CNR Bioinformatics Fine grain power management Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 49 Programming Models and Architectures for…… Network Power Estimate CNR Bioinformatics Network power estimate Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 50 Programming Models and Architectures for…… Three Dimensional Interconnect With Stacking CNR Bioinformatics Three dimensional interconnect with stacking Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 51 Programming Models and Architectures for…… Assembly of 3D Memory CNR Bioinformatics Assembly of 3D memory Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 52 Programming Models and Architectures for…… Recommended points from Berkeley CNR Bioinformatics ■ The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems ■ The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS per watt, MIPS per area of silicon, and MIPS per development dollar. ■ Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. A dwarf is an algorithmic method that captures a pattern of computation and communication. “Autotuners” should play a larger role than conventional compilers in translating parallel programs. ■ To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. ■ To be successful, programming models should be independent of the number of processors. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 53 Programming Models and Architectures for…… Recommended points from Berkeley CNR Bioinformatics ■ To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. ■ Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. ■ Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. ■ To explore the design space rapidly, use system emulators based on FPGAs that are highly scalable and low cost. maybe they missed some key point, for example: whenever it is possible, computational execution should happen in asynchronous manner Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 54 Programming Models and Architectures for…… Because Asynchronous CNR Bioinformatics Low power consumption, … due to fine-grain clock gating and zero stadby power consumption. ■ High operating speed, … operating speed is determined by actual local latencies rather than global worst-case latency. ■ Less emission of electro-magnetic noise, … the local clocks tend to tick at random points in time. ■ Robustness towards variations in supply voltage, temperature, and fabrication process parameters, … timing is based on matched delays (and can even be insensitive to circuit and wire delays). ■ Better composability and modularity, … because of the simple hanshake interfaces and the local timing. ■ No clock distribution and clock skew problems, … there is no global signal that needs to be distributed with minimal phase skew across the circuit. ■ Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 55 Programming Models and Architectures for…… Auto-tuners CNR Bioinformatics Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 56 Programming Models and Architectures for…… Computational Model CNR Bioinformatics ■ Designing clever parallel hardware and then work out how to program it is a big mistake. ■ Designing parallel programming languages and then work out how to implement them is usually a mistake. ■ Developing the right computational model alongside languages & hardware is the Key. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 57 Programming Models and Architectures for…… Computational Model CNR Bioinformatics ■ Think about systems, not just hardware or software. ■ There is lots of (possibly) relevant work e.g. - Dataflow (Single Assignment) - Graph Rewriting (Functional Languages) - Bulk Synchronous Parallelism (BSP) - Transactional Memory ■ Don’t ignore previous work and particularly don’t re-invent the wheel!. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 58 Programming Models and Architectures for…… Language Effectiveness CNR Bioinformatics 40 Language Effectiveness 35 30 Java 25 20 C++ 15 10 C 5 0 1970 1975 Workshop December 19, Napoli - Italy 1980 1985 1990 1995 2000 2005 R. Vaccaro & L. Verdoscia 59 Programming Models and Architectures for…… Language Effectiveness CNR Bioinformatics 10000000 Language Effectiveness Moore's Law 1000000 100000 10000 1000 100 10 1 1970 1975 Workshop December 19, Napoli - Italy 1980 1985 1990 1995 2000 2005 R. Vaccaro & L. Verdoscia 60 Programming Models and Architectures for…… CISC Architecture CNR Bioinformatics Huge effort into improving performance of sequential instruction stream ■ Complexity has grown unmanageable ■ Even with 1 billion transistors on a chip, what more can be done? ■ Pipelining Branch Prediction Out-of-Order Execution Prefetching Renaming Speculative Execution Workshop December 19, Napoli - Italy Value Prediction R. Vaccaro & L. Verdoscia 61 Programming Models and Architectures for…… TRIPS Prototype CNR Bioinformatics Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 62 Programming Models and Architectures for…… Cyclops-64 Architecture CNR Bioinformatics Cyclops-64 Programming Models and System Software Supports Application Programming API Co-array Fortran UPC+/- …… EARTH-C +/- OpenMP-XN Advanced Execution/ Programming Model MPI Kcc/gcc Percolation Compiler Cyclops Thread Virtual Machine Thread Management Tool chain Shared Memory Operations Fine-Grain Multithreading Thread Creation & Termination Dynamic memory management async function invocation Thread Synchronization acquire / release fibers Put / get Load Balancing Scheduling Base Execution Model Others Fine-Grain Multithreading (e.g. EARTH, CARE) Put / get with sync Location Consistency Infrastructur e and Tools System Software Simulation / Emulation Analytical Modeling SP SP SP SP SP TU TU TU TU TU TU … 4 GB/sec FPU FPU Off-Chip Memory FPU SP SP TU TU FPU 4 GB/sec 24x24 1 Gbit/s ethernet Crossbar Network A-Switch 6 Off-Chip Memory 4 GB/sec *6 6 24 PC cards in 1 shishkebab Other Chips via 3D mesh MEMORY BANK MEMORY BANK MEMORY BANK SP SP SP SP SP Workshop December 19, Napoli - Italy … MEMORY BANK MEMORY BANK SP MEMORY BANK MEMORY BANK 50 MB/sec MEMORY BANK Off-Chip Memory 1 PetaFlops DMA SP A-switch Off-Chip Memory Cyclops-64 ISA SP SP IDE HDD Communication Ports for 3D Mesh Inter-Chip Network R. Vaccaro & L. Verdoscia 63 Programming Models and Architectures for…… hHLDS CNR Bioinformatics The homogeneous High Level Dataflow System (hHLDS) model Firing rules in the classical model Let A={a1, …, an} be the set of actors and L ={ll, …, ln} be the set of links A dataflow graph is a labelled directed graph G = (N, E) where N=AL is the set of nodes E (A × L) (L × A) is the set of edges firing of an actor a token on each input link and no token on each output link Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 64 Programming Models and Architectures for…… hHLDS CNR Bioinformatics The hHLDS model Special actors in the classical model are characterized by having heterogeneous I/O conditions Merge A Switch B T F A A L L T F Gate Decider B R L L Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 65 Programming Models and Architectures for…… hHLDS CNR Bioinformatics Any actor has two input links and one output link and consumes and produces only data tokens firing of an actor a token on each input link effect consumes all input tokens and can produces a token on its output link b c b ≤ * a a+b*c + Workshop December 19, Napoli - Italy c a If b≤c then a + R. Vaccaro & L. Verdoscia 66 Programming Models and Architectures for…… hHLDS CNR Bioinformatics The hHLDS model Comparison between the two models input (a, c) b := 1; repeat if a > 1 then a := a \ 2 else a := a * 5 b := b * 3; until b = c; output (d) c 1 T F T T 1 T F a b LST * a c LST 1 2 3 3 * = F 2 5 _: 5 *4 3 1 F F F > a + < 6 7 + 8 9 T T F F /2 T = 10 5 = 11 >1 + 12 F T +13 d d a) Workshop December 19, Napoli - Italy +14 F b) R. Vaccaro & L. Verdoscia 67 Programming Models and Architectures for…… Dataflow Computational Model CNR Bioinformatics memory memory + Initial + values Results + DATA Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia 68 Programming Models and Architectures for……