Technically Superior But Unloved: A Multi-Faceted Perspective on Multi-core's Failure to Meet Expectations in Embedded Systems by MASSACHUSETTS INSTITE' Daniel Thomas Ledger OF TECHNOLOGY B.S. Electrical Engineering JUL 2 0 2011 Washington University in St. Louis 1996 LIBRARIES B.S., Computer Engineering ARCHIVES Washington University in St. Louis 1997 SUBMITTED TO THE SYSTEM DESIGN AND MANAGEMENT PROGRAM IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN ENGINEERING AND MANAGEMENT AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUNE 2011 @2011 Daniel Thomas Ledger. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic medium now known or hereafter created. copies of this thesis document in whole or in part K,- /II Signature of Author: Daniel Thomas Ledger Fellow, System, ^// ign and Management Program May 6 h, 2011 Certified By: Senior Lecturer, Engineering ystems Division and the Sloan School of Management Thesis Supervisor Accepted By: Patrick Hale Senior Lecturer, Engineering Systems Division Director, System Design and Management Fellows Program Technically Superior But Unloved: A Multi-Faceted Perspective on Multi-core's Failure to Meet Expectations in Embedded Systems By Daniel Thomas Ledger Submitted to the System Design and Management Program on May 6 th, 2011 in Partial Fulfillment of the Requirements for the Degree of Master of Science in Engineering and Management Abstract A growing number of embedded multi-core processors from several vendors now offer several technical advantages over single-core architectures. However, despite these advantages, adoption of multi-core processors in embedded systems has fallen short of expectations and not increased significantly in the last 3-4 years. There are several technical challenges associated with multi-core adoption that have been well studied and are commonly used to explain slow adoption. This thesis first examines these technical challenges of multi-core adoption from an architectural perspective through the examination of several design scenarios. We first find that the degree of the technical challenge is highly related to the degree of architectural change required at the system level. When the nature of adoption requires higher degrees of architectural change, adoption is most difficult due to the destruction of existing product design and knowledge assets. However, where adopting firms can leverage existing architectural design patterns to minimize architectural change, adoption can be significantly easier. In addition to the architectural challenges, this thesis also explores several other factors that influence adoption related to management strategy, organization, ecosystem, and human cognitive and behavioral tendencies. Finally, this thesis presents a set of heuristics for potential adopters of multi-core technology to assess the suitability and risk of multi-core technology for their firm's products, and a second set of heuristics for supplier firms developing or selling multi-core processors to determine their likely success. Thesis Supervisor: Michael A.M. Davies Title: Senior Lecturer, Engineering Systems Division and Sloan School of Management Page 2 of 106 Acknowledgements I would like to offer my gratitude to the colleagues who have contributed to this thesis and my degree at MIT. Thank you for your precious time, your ideas, and the great discussions your insights have been invaluable in shaping this thesis. To the community of students and professors at MIT, thank you for an incredible experience over the last 30 months. It's been a pleasure and an honor getting to know so many wonderful and talented people. To my thesis advisor, Michael Davies, thank you for the time, support and encouragement over the last year. The knowledge and guidance you've provided as both a professor and a thesis advisor have been so valuable. To Pat Hale and the SDM team, thank you for creating and running such an incredible program. To my better half, Lauren, and our two young boys, Andrew and David - thank you for the love, patience, support, compassion, understanding and help over the last 30 months. It goes without saying that none of this would have been possible without you. To my friends and extended family, thank you all for the love, support and tolerance. To my employer, Analog Devices, thank you for flexibly and support in allowing me to pursue this degree on a part time basis. Page 3 of 106 Table of Contents Abstract................................................................................................................................. 2 A cknowledgem ents............................................................................................................ 3 Table of Contents............................................................................................................... 4 List of Figures....................................................................................................................... 8 1 Introduction and M otivation...................................................................................... 10 2. Theory Creation M ethodology .................................................................................... 12 3. A rchitecture, Innovation and Dom inant Designs........................................................ 15 4. Structure & Architecture.................................................................................................. 15 Com plexity and Com plicatedness............................................................................... 15 Decom position & M odularity ...................................................................................... 16 Hierarchy......................................................................................................................... 18 Design Patterns ............................................................................................................ 19 Dynam ics of Technology Evolution and Innovation...................................................... 20 Product Know ledge and Assets ................................................................................... 20 Dom inant Designs....................................................................................................... 21 Technology Innovation ............................................................................................... 22 Em bedded System s .................................................................................................... 25 Em bedded System s and Embedded Processors ............................................................. 25 Em bedded Operating System s ........................................................................................ 27 System and Processor D iversity................................................................................. 28 M ore Lim ited Resources............................................................................................. 29 Platform "Stickiness"................................................................................................ .29 Product Life Cycle & Legacy ................................................................................. .. 30 Real Tim e Constraints.......................................................................................... . .. 31 Page 4 of 106 5. 6. 7. Platform Certification ..................................................................................................... 31 Sum m ary......................................................................................................................... 31 Form s of Parallelism & Concurrency............................................................................. 33 Granularity .......................................................................................................................... 33 Types of parallelism ........................................................................................................ 33 Bit Level Parallelism ................................................................................................... 34 Instruction Level Parallelism (ILP) ................................................................................ 34 Task Parallelism & Operating System s ....................................................................... 36 Sum m ary......................................................................................................................... 40 Attributes of M ulti-core Processors............................................................................ 41 Number of cores............................................................................................................... 41 Types of Cores - Hom ogeneous and H eterogeneous ...................................................... 41 Resource Sharing ................................................................................................................ 42 M emory Architecture....................................................................................................... 43 Shared M em ory.......................................................................................................... 45 Distributed M em ory..................................................................................................... 47 Hybrid Variants.......................................................................................................... 47 M ulti-Threading............................................................................................................... 47 Asymmetric Multiprocessing (AMP) and Symmetric Multiprocessing (SMP) ............. 48 The Future of SM P..................................................................................................... 50 Em bedded M ulti-core Processors A doption ............................................................... 52 Benefits of M ulti-core in Embedded System s ............................................................... 52 Perform ance.................................................................................................................... 52 Pow er Dissipation....................................................................................................... 53 Size / D ensity / Cost..................................................................................................... 54 A rchitectural Factors & Challenges................................................................................. 55 Page 5 of 106 8. 9. Structure / Partitioning................................................................................................ 56 Dynam ics / Interactions ................................................................................................ 61 System Optim ization and Debug .................................................................................... 64 Adoption Scenarios.......................................................................................................... 66 Current Multi-core Adoption Patterns ....................................................................... 67 Case 1: Existing Multiprocessor Hardware Design Pattern......................................... 67 Case 2: Existing Software Design Pattern .................................................................. 71 Case 3: No Existing Patterns with Legacy Code ........................................................ 74 Case 4: No Existing Patterns with New Code ............................................................. 74 Sum m ary of Adoption Scenarios................................................................................. 75 System Level Factors and Challenges............................................................................ 77 Factors and Dynamics within an Adopting Firm........................................................ 77 Factors and Dynamics Surrounding a Firm ............................................................... 86 Human / Cognitive Factors and Dynamics .................................................................. 92 - ....... Adoption Heuristics......................................................................... Heuristics for Adopters of Multi-core Embedded Processors ......................................... 95 95 Nature of the processing ............................................................ 95 Existence of Legacy Code ....................................................... 96 ........ ... 96 Suppliers & Ecosystem................................................. 97 Heuristics for Suppliers of Multi-core Embedded Processors....................................... 97 ............ 98 Competencies............................................................. .- -- Product Attributes .......................................................... - Target Market............................................................................... 98 Target Customers ........................................................ 99 Architectural Compatibility ........................................... 99 Competitive Landscape..................................................................... 99 Page 6 of 106 Ecosystem ....................................................................................................................... 99 10. Conclusions .............................................................................................................. 101 11. Appendix A: Libraries and Extensions Facilitating the Programming of Multi-core Processors 102 12. Bibliography ............................................................................................................. 103 Page 7 of 106 List of Figures Figure 1: Thesis structure........................................................................................................ 11 Figure 2: Observation, Categorization, Formulation Pyramid............................................ 14 Figure 3: Containment hierarchy example - Linux file structure ........................................ 19 Figure 4: Modular System Level Innovation, Radical Sub-System Level Innovation........ 24 Figure 5: 2008 Microcontroller Revenue by Market (Databeans, 2008)............................ 26 Figure 6: OS market share for desktop computers............................................................. 27 Figure 7: Most commonly used operating systems in embedded systems ......................... 28 Figure 8: 2008 Microprocessor Revenue by Processor Type (Databeans, 2008)............... 34 Figure 9: Threads and Process - Possible Configurations .................................................. 37 Figure 10:PO SIX A PIs ...................................................................................................... 40 Figure 11: Intel Media Processor CE 3100 Bock Diagram ................................................ 42 Figure 12: Apple A 5 dual-core ARM A9 ............................................................................... 42 Figure 13: Intel "Yorkfield" Quad-core MCM .................................................................. 43 Figure 14: Memory access times (core clock cycles) for different levels of memory........ 44 Figure 15: Memory Latency on ARM Cortex A9 .................................................................. 45 Figure 16: Shared Memory Architecture ............................................................................ 46 Figure 17: Shared Memory Architecture with Cache Coherency....................................... 46 Figure 18: Distributed Memory Architecture .................................................................... 47 Figure 19: Single Threading vs. Super-threading vs. Simultaneous Multithreading.......... 48 Figure 20: AM P Configuration........................................................................................... 49 Figure 21: SM P Configuration........................................................................................... 49 Figure 22: Intel Processor Clock Speed by Year............................................................... 51 Figure 23: Power Consumption Comparison of single and dual core implementations of the Freescale M PC 864 1...................................................................................................................... 53 Figure 24 Dynamic Current over Frequency for the ADSP-BF533 Embedded Processor..... 54 Figure 25: Multi-core architecture challenges .................................................................... 56 Figure 26: Cisco Telepresence System .............................................................................. 57 Figure 27: Performance scaling by number of threads and percentage of parallel code ........ 58 62 Figure 28: Multi-core performance vs. number of cores ................................................... Figure 29 : Performance impact as a function of cores (2-3 threads per core)................... 62 Figure 30: Thread scaling with exaggerated synchronization overhead............................. 63 Page 8 of 106 Figure 31: Case 1, heterogeneous architecture .................................................................. 67 Figure 32: Case 1, homogeneous architecture .................................................................... 68 Figure 33: Case 1, no new resource sharing ....................................................................... 69 Figure 34: ADSP-14060 Quad processor DSP .................................................................. 69 Figure 35: Case 2, new resource sharing ............................................................................ 70 Figure 36: C6474 Block D iagram ...................................................................................... 70 Figure 37: Case 2, Symmetric Multiprocessing.................................................................. 71 Figure 38: Migration from single-core to dual-core with SMP & POSIX ......................... 72 Figure 39: Case 3, Migration to homogeneous and heterogeneous scenarios .................... 74 Figure 40: Case 4, New development to homogeneous SMP, homogeneous AMP or heterogeneous M C ........................................................................................................................ 75 Figure 41: Multi-core Adoption Scenarios ......................................................................... 76 Figure 42: Layers of Adoption Factors & Dynamics........................................................... 77 Figure 43: WW Respondents Working with/Programming Multi-core and/or Multiprocessor D esign s.......................................................................................................................................... 81 Figure 44: Xilinx Zynq-7000 Extensible Processing Platform............................................ 82 Figure 45: Nvidia Tegra 2 Processor ................................................................................... 83 Figure 46: Performance and Power Gains with Hardware Acceleration (Frazer, 2008)........ 84 Figure 47: Cost per gate by process node (Kumar, 2008) ................................................... 84 Figure 48: SPI's High End SP16HP-G220 Processor Block Diagram................................ 88 Figure 49: Tilera's TILE64 Processor Block Diagram ........................................................ 89 Figure 50: "Valley of Death" for a revolutionary technology (Golda, et al., 2007) ........... 91 Figure 51: "Valley of Death" for an evolutionary technology (Golda, et al., 2007)........... 91 Page 9 of 106 1. Introduction and Motivation A growing number of embedded multi-core processors from several vendors now offer several technical advantages over single-core architectures. Multi-core processors offer increased computational density - a quad core processor has 4x the theoretical computational power as single core version of that device. Multi-core processors may use less power to accomplish a similar task - having two cores running at a slower clock speed can be more power efficient than a single core running at a higher clock speed1 . Multi-core processors are smaller and typically less expensive than multiple single-core devices - it's possible to migrate an existing design that used multiple discrete single-core devices into a single multi-core device, for example. For some applications, multi-core provides increased reliability by reducing the number of discrete parts. Despite the numerous advantages that multi-core architectures offer, developing a product using a multi-core processor architecture is challenging. Over the years, there has been a great deal of research aimed at studying the technical challenges of multi-core and the concomitant challenge of concurrentprogramming and proposing new ways to approach them. In the course of researching this thesis, I came across several papers, articles, blog posts, and forum threads describing the difficulties associated with concurrent programming and multi-core architectures. Adoption of multi-core processors in embedded systems has not increased significantly in the last 3-4 years. In the embedded space, multi-core processor usage has only increased 6% across systems that use multiple processors between 2007 and 2010 (UBM/EE Times Group, 2010). This thesis first explores these technical challenges of multi-core adoption from an architectural perspective through the examination of several design scenarios. We first find that the degree of the technical challenge is highly related to the degree of architectural change required at the system level. When the nature of adoption requires higher degrees of architectural change, adoption is most difficult due to the destruction of existing product design and knowledge assets. However, where adopting firms can leverage existing architectural design patterns to minimize architectural change, adoption can be significantly easier and hence faster. This thesis also explores several other factors that affect adoption mechanisms at the organizational, managerial, value chain and cognitive levels. 'See Figure 23 on page 49 Page 10 of 106 This thesis is arranged in five parts as shown in Figure 1. Chapter 2 provides a high level framework that is used in developing theories about multi-core adoption. Chapter 3 then introduces several concepts related to system architecture and technology innovation that will be used throughout the thesis. Chapter 4, 5 and 6 provide contextual information on the nature of embedded systems, embedded processors, parallel architectures, and multi-core processors. Chapters 7 and 8 explore multi-core adoption patterns, categories and causal mechanisms at the architectural level as well those related to management strategy, organization, ecosystem/value chain, and human behavioral and cognition. Finally, Chapter 9 proposes two sets of heuristics. The first set of heuristics predict the likelihood of success of a firm adopting a multi-core processor in a product design on the demand side. The second set of heuristics characterizes successful multi-core product offerings on the supply side. Categorization Causal Mechanisms Architectural Mgmt, Org, Value Chain, Cognitive Figure 1: Thesis structure Page 11 of 106 2. Theory Creation Methodology Multi-core processors offer several technical advantages over single core processors yet despite these advantages, adoption in the embedded space has been slow and has fallen short of expectations. The goal of this thesis is to establish a set of theories to explain this anomaly and use these theories to predict successful adoption patterns for multi-core processors. We will do this by first categorizing and studying the various causal mechanisms that lead to adoption and then developing a set of heuristics which can be used to predict adoption patterns for adopters and prescribe successful strategies for suppliers. There is a great deal of literature and commentary around the challenges of multi-core development and it is tempting to conclude that multi-core adoption is happening slowly just because it is hard. However, multi-core has become established in several types of embedded applications like wireless infrastructure and networking equipment. And despite slow adoption in general, multi-core processors are being rapidly adopted in some specific segments such as smart phones and tablets. These phenomena are also anomalies that warrant robust explanations, that is a well-grounded theory which explains the underlying causal mechanisms that have led to these observations about the adoption of multi-core in embedded systems. Clayton Christensen notes that several management books today present management theories which prescribe a series of actions because those actions have lead to certain results for some firms in the past. However, these texts often fail to present how and why the actions lead to the desirable results. They highlight a correlationbetween an action and a result without understanding and presenting the causal mechanism that connects the two. Attempting to repeat the action that correlates with desirable results and expecting the same can be a very disappointing exercise (Christensen, et al., 2003). This tendency to rely on correlation as a comfortable substitute for causality is deeply ingrained in our thinking and behavior. Our minds establish mental models of our complex surroundings based primarily on observable causal relationships in our environments. Unbeknownst to us, our minds will seamlessly default to casual relationships as a means to explain phenomena and predict future occurrences (Sterman, 2000). Page 12 of 106 There has been a great deal of research on this subject. The field of system dynamics in particular focuses heavily on the limitations of the mental models we create to explain the world around us. John D. Sterman cites several important pieces of research in his book, Business Dynamics. He cites (Axelrod, Hall, D6rner and Plous) that suggest the following limitations. The concept of feedback is often absent from our mental models/cognitive maps (Axelrod). We tend to think in "single strand causal series" and find it difficult to comprehend systems with feedback loops and multiple causal pathways (Drner). Furthermore, we tend to think that each effect comes from a single cause (Plous). Finally, people have difficulty understanding the relationships between phenomena when random error, nonlinearly and negative correlation are involved (Brehmer). Sterman also notes that we have a very short term memory for cause and effect and when events are separated by more than a few minutes, it's often very difficult for us to associate them. We struggle with complex systems and the dynamics of those systems and we, to the best of our abilities, will attempt to use correlation to explain behaviors of systems simply because that's how our minds are designed to work. When it comes to forming theories, another important tendency we have as humans (that we're also typically unaware of) is our tendency to filter information based on preexisting beliefs. An existing established paradigm or belief may suppress our ability to perceive data that is inconsistent with this existing paradigm which limits our abilities to see new paradigms emerging (Kuhn, 1970). Once we believe the world is a certain way, we cannot easily see evidence that suggests differently. So it's only natural that we as humans are often satisfied with a correlation between events to explain causation because our minds make it feel so convincing. Yet as we deal with increasingly complex systems, these cognitive limitations can lead us into some very misguided beliefs that we may later have a difficult time parting with. With respect to multi-core, there is currently a correlationbetween the fact that developing products with multi-core processors is technically challenging and that adoption rates are generally low. However, this doesn't explain why multi-core isn't being adopted because, in many cases it has been adopted and other cases, it's being rapidly adopted. A good theory is a statement which explains how and why certain actions can lead to certain results - they help explain what is happening in the present and also what will likely happen in Page 13 of 106 the future (Christensen, et al., 2003). Using Christensen's framework as presented in Figure 2 below, we will first attempt to categorize the anomalies and identify causal mechanisms not only as they pertain to the technology itself but also to the surrounding layers like the management strategy, organization of the adopting firm, the structure and dynamics of the ecosystem surrounding the technology and cognitive factors that contribute to the adoption process as well. Prediction Formation of a theory: A statement of what causes what and why Categorization Confirmation Anomna y Observation and description of the phenomenon Figure2: Observation, Categorization,FormulationPyramid2 From these causal mechanisms, we will present a set of heuristics which can be used to predict the likelihood of a successful adoption of multi-core processors by adopting firms and also prescribe strategies that can predict success for suppliers of multi-core processors. The process of categorization of the phenomena and identification of causal mechanisms that will be used later in this thesis relies on several concepts related to the structure and architecture of systems and the dynamics behind technology evolution and innovation that will be explored in the next section. From here, we will explore several important topics related to the unique characteristics of embedded systems, forms of parallelism in processors and the key attributed of multi-core processors that will be used as we categorize the adoption patterns of multi-core processors in embedded systems. 2 (Christensen, et al., 2003) Page 14 of 106 3. Architecture, Innovation and Dominant Designs The adoption of multi-core processors can require changes to an end-product design at both the component and at the architecturallevel. There are several key concepts related to the product architecture and the dynamics of innovation and adoption that will be used throughout this thesis which we will present in this section. Structure &Architecture Programming multi-core processors is centered on a paradigm of breaking problems into smaller pieces so the work can be distributed across computing elements and reliably processed concurrently. This involves determining how to take large pieces of complex software and partition them across multiple cores in a way that it runs reliably and with more performance. We are dealing with software systems, which unlike physical systems are capable of virtual complexity; thus the management of complexity plays a key role in the process of using multicore processors. Modularityand hierarchyalso play an important role in managing complicatedness and complexity; they are valuable tools for decomposing larger systems into smaller pieces. As a result, we will first explore the relevant concepts of complexity, complicatedness, modularity and hierarchy. The end-products tend to be incremental in nature and reuse from existing designs is very common. As designpatternsborrowed from existing designs or elsewhere in industry can facilitate adoption of multi-core, we will also explore the concept of design patterns. Complexity and Complicatedness Complexity is a quantifiable attribute of an architecture that describes the number of elements in the architecture, the degree of interconnectedness between those elements and, by some definitions, the level of dynamics present in the architecture. Complexity is an important attribute of any architecture that needs to be understood, developed or managed by a human because our capabilities as human beings to manage complexity are both limited and unfortunately not evolving at a rate to keep up with the complexity of the systems we are designing. Page 15 of 106 As we create larger systems using larger teams and interconnect these systems with an increasing number of linkages, our ability to comprehend, develop, predict and manage the behaviors of these systems is becoming increasingly limited. Thus we need to rely on tools and methodologies to assist in the design of systems whose complexity surpasses the capabilities of a single human being (Baldwin, et al., 2000). In the immortal words of Professor Ed Crawley, "Complexity is neither good nor bad, but must be managed." (2009) Complexity is especially critical in the study of software and embedded systems because unlike physical systems which are bound by the laws of nature, software systems are practically unbounded and thus have the potential for almost unlimited complexity in comparison the physical systems we create (Leveson, 1996). And, not only do these systems themselves become more complex but the human organizational system responsible for coordinating the development of these systems must also become more complex (Baldwin, et al., 2000). Complicatedness is a closely related system attribute that is sometimes used interchangeably with complexity but is distinct in both definition and importance in the scope of architecture. Complicatedness is a qualitative metric that refers to the difficulty humans have comprehending systems and this attribute is inherently more subjective. A key challenge of a system architect is to manage the evolution of a system's complexity in manner in which a complex system doesn't appear to be complicated (Crawley, 2009). Decomposition & Modularity A module is a collection of elements in a system that has been grouped by a common intent and in a manner which minimizes interaction between other modules. Modular design is typically a tops-down process whereby a larger system is decomposed into smaller modules. Modular design starts with a high level system design which is then recursively decomposed into smaller modules until the complicatedness of a single module can be comprehended by a single individual, and its complexity can be isolated and hidden through a simple interface design through which it connects to adjacent modules (Baldwin, et al., 2000). Modularity is an important principle for designing, developing and maintaining complex system because it both improves the comprehensibility and reduces the complicatedness of a system. In the domain of software, decomposing systems into modules and providing clean abstraction layers via interfaces has become a standard practice today because it helps us deal with the relatively unbounded levels of complexity. Page 16 of 106 Modularity also provides a means for multiple people to work on different parts of the system simultaneously. People don't need to understand the whole system; rather they need to understand how their module operates and how it must interface with the components around it. "If an artifactcan be divided into separateparts,and the partsworked on by different people, the 'one person' limitation on complexity disappears.But this implies that techniquesfor dividing effort and knowledge arefundamental to the creation of highly complex manmade things." (Baldwin, et al., 2000) Fundamental to this capability is the concept of information hiding. Originally proposed by David Pamas (1972), information hiding is the practice of hiding certain functions and information within the module. As long as the module's interfaces are preserved, changes to the hidden functions and information don't impact other modules in the system. This is an essential attribute if multiple developers are to simultaneously work on multiple modules within the system. A fantastic modem example of a large-scale, highly modular product design is the Linux operating system. Alan MacCormack's research on Linux shows that more distributed design teams must rely on modular design because the level of communication between the engineers is significantly lower. By clearly defining modules and interfaces, portions of a distributed team can work within a module without needing to understand how the whole system works. This becomes particularly important as the size and complexity of the software increase. Without a modular design, it would be extremely difficult for developers to learn how the whole system works and coordinating the development would be close to impossible (MacCormack, et al., 2006). In order to decompose a system into modules, that system must first be decomposable. The decomposability of a system is the extent to which it can be iteratively decomposed in a manner in which high-frequency interactions occur within a sub-system and low-frequency interactions occur between subsystems (Simon, 1962). The attribute of decomposability is central to our discussion on multi-core systems because in the process of migrating to a single-core to a multicore design, we must decompose the system into small pieces that can be distributed to these cores. If a system is decomposable, a key challenge is developing the right strategy to decompose the system. For tangible objects that are more naturally bounded in their complexity, there may Page 17 of 106 be intuitive points of modularization. However, for software systems which may consist of several elements in a highly interconnected configuration, the boundaries may not always be clear (Baldwin, et al., 2000). There are several decomposition strategies that are useful for different types of systems, particularly in software which is significantly less bounded than the physical domain. Systems can be decomposed by breaking down the large system functions into small steps. This is called functional decomposition, which happens to be one of the oldest and most common software architecture methodologies, and has been widely taught and known (Bergland, 1981). There are also very different approaches like decomposing a system based on interactions, which has been promoted as possibly a more optimal approach for multi-core processor architectures (Stein, 1996). Decisions in the decomposition and modularization process need to be made very carefully as the grouping of elements and definition of interfaces will likely exist for the entire lifespan of the product. As Bob Niebeck of Boeing noted about architecting aircraft, "as soon as you make something common, you're living with it for the life of the plane." (2009) Hierarchy The hierarchy of a structure typically describes how components in the structure are associated. An organizationalhierarchyof a company, for example, is used to describe a ranking of individuals and who they are subordinate to. A containmenthierarchydescribes which components of a system contain other components. The Linux file system hierarchy shown in Figure 3 is an example of a containment hierarchy - certain directories contain other directories which in turn contain more directories. Page 18 of 106 lxduJ Lxterin lxvi (agJLhowto]Lpackagj Figure 3: Containment hierarchy example - Linux file structure3 A compositionalhierarchyrecursively describes the compositional structure of a component in terms of the sub-components it consists of (Parsons, 1994). This is very similar to the hierarchicsystem concept that Herbert Simon proposed: one that is composed of interrelated sub-systems which are in turn composed of smaller, interrelated sub-systems (Simon, 1962). The terms components and sub-systems are synonymous with each other and in many ways, are synonymous with the concept of modules described earlier; as a module can be composed of several smaller modules. Computing systems are often organized as compositional hierarchies in both their hardware design and their software design. In a hardware design, there may be a Printed Circuit Board (PCB) containing several Integrated Circuits (ICs) which is a system. Each of these integrated circuits can themselves be complex sub-systems like processors or field programmable gate arrays (FPGAs) that are in turn composed of several further interrelated sub-systems. In the software domain, large applications consist of smaller sub-systems. For example, a word processing application may consist of a sub-system to manage the display and user interface, one to manage spell checking, one to manage file storage and retrieval, etc. Design Patterns The concept of design patterns originated in Christopher Alexander's studies of cities and architecture. Alexander defined a design pattern as the "core of the solution" to a recurring 3http://www-uxsup.csx.cam.ac.uk/pub/doc/suse/suse9.3/suseIinux-useryuide en/cha.shell.html Page 19 of 106 problem, which allows the solution to be applied universally without "ever doing it the same way twice". (Alexander, 1977). He prescribes that the solutions present in these design patterns be presented in a general and abstract manner that allows them to be easily applied to problems. The use of design patterns has become popular in the last few decades amongst programmers for obvious reasons: many programmers are solving similar problems in very different contexts. The availability of a proven pattern that can be adapted to the problem at hand has clear merit. As we will see later in this thesis, architectural design patterns can play a key role in easing the difficulties associated with multi-core adoption and development. Dynamics of Technology Evolution and Innovation Firms developing products based around processors accumulate a great deal of knowledge and problem solving skills that become key assets for ensuring deterministic product development schedules. Multi-core processors in particular require specific forms of knowledge and problem solving skills that existing firms developing single-core and even multiprocessor designs may not posses. The concept of knowledge and asset specificity to a product is the first topic that will be addressed in this section. Dominant designs exist in several forms with computing systems and can reduce the degree to which new knowledge assets need to be established when moving between technologies, like operating systems. This section will present the concept of dominant designs which will be built upon later in the thesis. Finally, the nature of innovation is essential to this discussion as different types of innovation have different implications for the firms adopting the innovation. The technology innovation framework developed by Henderson and Clark is particularly useful in the study of multi-core processor adoption and will be used throughout this thesis. It is the third concept presented in this section. Product Knowledge and Assets As a firm develops a product, it first builds and then leverages assets (knowledge and physical) around the product (Williamson, 1985). Product knowledge can be categorized as an understanding of the individual components of a system, referred to as component knowledge, and an understanding of how those components Page 20 of 106 interact to create a desired function, referred to as architecturalknowledge (Henderson, et al., 1990). In the context of a system containing hardware and software, an example of component knowledge could be an understanding of a certain software module like a TCP/IP stack. Architectural knowledge could be the way to balance memory transactions on an external memory bus to maximize the performance of the TCP/IP code. In this sense, we have an interaction between the set of instructions executing the TCP/IP stack and the external memory interface of the processor. Another key type of knowledge associated with a product design is the set of strategies used for problem solving (Williamson, 1985). Engineers develop knowledge from solving specific problems on previous projects that they can apply to new problems they encounter. This knowledge is beneficial when they encounter similar problems in the future. However, this knowledge can also be detrimental - when engineers encounter new problems, they may fall back on problem solving strategies from older problems rather than considering all of the alternative (and potentially more suitable) problem solving strategies for a new problem (Henderson, et al., 1990). This concept is particularly important to this thesis because concurrent programming represents a very different kind of problem yet we see a pattern (to be discussed later in this document) in which existing solutions are being applied. Dominant Designs Several authors have described a common cyclical product innovation pattern that can be seen across many types of technologies whereby a new technology is introduced that offers the promise of one or more benefits across certain technology parameters. Murmanna and Frenken's meta-study on dominant design provides a broad overview of the various areas of research around dominant designs (Murmanna, et al., 2006). In the early phases of a new technology, the industry is heavily experimenting with different product design concepts and developing knowledge. Eventually, a product architecture emerges that the industry gels around, which is then widely adopted and changes the nature of the competitive landscape. At this point, companies shift from rapid experimentation to cost reduction around the dominant design, a transition that fundamentally changes the nature of competition (Utterback, 1994). Dominant designs can be seen in several forms within computing systems. What is interesting is that we see dominant designs at various levels of the compositional hierarchy. For example, we have arguably seen the emergence of a dominant design in the single core Page 21 of 106 embedded processor space. Despite the fact that processor architectures vary, the design concept between processors is fundamentally the same - a series of instructions is processed sequentially and a memory system holds program state information. The process through which instructions are developed has become highly standardized and the interfaces through which these processors connect to other components in the larger system have also become highly standardized. Dominant designs have emerged for other system components as well. For example, DRAMs have become a price driven commodity business built around a standard architecture 4 . We also see dominant designs emerging for software components. For example, the multi-threaded operating system dates back several years and today, there are several types of multi-threaded operating systems in wide use today (Linux, Microsoft Windows, OS-X and iOS, VX Works, etc) but the design concept about how threads work has stabilized and processor architectures have evolved to accommodate this software design. The emergence of dominant designs in processor and operating system architecture have enabled programmers to migrate software designs between processors and operating systems without having to completely rewrite them from scratch to accommodate a radically different processor and/or operating system design. However, as we will see, multi-core processors represent a departure from these dominant designs. Technology Innovation The characterization of different forms of innovation as they apply to a product's architecture is particularly important with respect to embedded processor technology and the larger systems they're a part of. Henderson and Clarke (1990) provide a very useful framework for characterizing innovation that can occur within a product or system. Incremental Innovation refers to innovation that improves components within a design but leaves the architecture intact. This type of innovation fortifies a firm's component and architectural knowledge. We have seen this type of innovation for years in the processor industry in the form of increasing clock speeds of processors. The interactions between components and the processor architecture remains the same and thus existing knowledge is preserved as well as other assets like software developed for that processor. 4DRAMeXchange provides contract pricing for DRAMs from several suppliers. http://www.dramexchan-e.com/ Page 22 of 106 Modular Innovation refers to innovation that changes a core design concept of a module but preserves the architecture. Modular innovation destroys component knowledge related to the component where the innovation has occurred but it preserves architectural knowledge about how the components link together. An example would be a company migrating from a MIPS based processor to an ARM based processor. These processors have similar interfaces and likely support the same native language (i.e. C or C++) and operating system (Linux, VX Works, etc). The processor will execute software and interact with peripherals in a similar manner. However, an organization will now need to learn the ARM core. Most of the knowledge specific to the MIPS core they had become familiar with is no longer useful. Component knowledge goes beyond just the device in this case and also covers the development tools used to program the device and debugging/problem solving techniques that may be specific to that product. Architectural Innovation-occurs when design concept of the components is preserved but the interaction between the components changes. Architectural innovation preserves component knowledge and destroys architectural knowledge. This is often caused be a change in a component that results in new interactions. Radical Innovation occurs when both the design concepts of the components and the linkage between them are overturned. This form of innovation destroys both component and architectural knowledge. Radical innovation typically occurs before the formation of a new dominant design. A key limitation of Henderson and Clark's four forms of innovation is that they fail to incorporate the concept of compositionalhierarchy (Murmanna, et al., 2006). For example, radical innovation within the scope of a sub-system may manifest itself as a modular innovation at the system level as shown below in Figure 4 if the subsystem interfaces are preserved and the nature of interactions at the system level doesn't change. This is a particularly important point within the scope of this thesis because a processor typically represents a sub-system within an embedded system. A move to multi-core can be considered a radical innovation within the scope of that subsystem. However, if the subsystem's external system interfaces are preserved, this radical innovation at the subsystem level is presented as a modular innovation at the system level. Page 23 of 106 Modular Innovation at system level Radical Innovation at subsystem level (within A) note: external interfaces are preserved Figure 4: Modular System Level Innovation,RadicalSub-System Level Innovation In this section, we have examined several important concepts and frameworks that will be applied throughout this thesis. Page 24 of 106 4. Embedded Systems Multi-core processors have been broadly adopted in desktops, laptops and servers over the last 3-5 years. The current portfolio of processors from Intel and AMD that target these devices are almost all multi-core today5 . All major operating systems for these devices (Windows 7, Apple OS-X and Linux) support multi-core processors and applications seem to run as reliably as they did on single core machines, in many cases with increased system performance. So why hasn't this transition happened in a similar fashion within embedded systems? This section provides important contextual information, outlining the key elements of embedded systems, such as processors, operating systems and programming languages, and their key attributes, such as diversity, relevant knowledge and design constraints. Embedded Systems and Embedded Processors PCs, laptops and servers are considered general-purposecomputers meaning they are designed to run a broad class of applications. The function of the computer is largely defined by the software that the computer user is running on it. Embedded Systems on the other hand, are generally classified as a system that contains a processor and is designed to deliver a specific set of dedicated functions. Embedded systems can be extremely simple like the controller for a microwave, which is typically powered by a simple 8-bit microcontroller. They can also be highly sophisticated, such as multichannel wireless processing systems within a cellular base station. An embedded system may also be part of a compositional hierarchy and thus a component within a larger system. For example, modem automobiles contain several embedded systems to control various elements of the vehicle: from cruise control to lane departure warning systems to the timing of the engine itself. Embedded systems are used in all of the major electronics segments including consumer, communications, automotive, industrial, instrumentation, healthcare, military, and aerospace. Even general purpose computers contain smaller embedded systems to control things like the power supply or the DVD drive, for example. s Intel Processors: http://www.intel.com/products/desktop/processors/index.htm AMD processors: http://www.amd.com/us/products/Pages/processors.aspx Page 25 of 106 2008Microcontroller Market Revenue=$13.7 Billion USD GeneralPurpose 16% 40% 4% Consumner arnt 12% Figure 5: 2008 MicrocontrollerRevenue by Market (Databeans,2008) The first key attribute that distinguishes embedded systems from general purpose computers is that they are extremely diverse. Embedded systems are typically powered by a class of processors known as embedded processors. Embedded processors vary widely in their capabilities based on the applications they serve and thus are equally diverse. There are several classes of embedded processors: microcontrollers (MCUs), digital signal processors (DSP), microprocessors, System-on-Chip (SoCs) and more. Microcontroller (MCU): A processor that has a rich variety of on-chip peripherals that are optimized for specific system functions. Microprocessor (MPU): In contrast to a microcontroller, a microprocessor is a more powerful processor typically designed for general purpose functions. Digital Signal Processor (DSP): A DSP is a type of microcontroller that is optimized for real-time and computationally intensive applications. DSPs are used in audio & video processing, wireless communications and system control applications. System-on-Chip (SoC): An SoC is typically a collection of processor cores and dedicated hardware optimized for different tasks. A large majority of the processors shipped world-wide each year are embedded processors. In 2010, there were 9.01 billion units shipped of embedded processors according to VDC Research (2011). In 2009, there were 308.3 million PC units shipped according to Gartner research (2010) and 6.6 million server units shipped according to IDC (2010). While some classes of servers and PCs utilize multiple processors, the embedded market is still at least an order of magnitude larger in size in terms of unit shipments. Page 26 of 106 In contrast to the heterogeneity of embedded systems, the general purpose processors that power our PCs, laptops and servers are comparatively homogeneous. Migrating an application from a laptop with an Intel processor running Windows to a laptop with an AMD processor running Windows is a trivial process. While there may be subtle performance differences, the application will run without needing to be re-architected or recompiled. In the embedded systems domain, migrating from one processor to another is not, however, a trivial process because the processor architectures, instruction sets and capabilities vary to a much greater degree. Some amount of work is almost always required on the software to move between processors. If migrating between two processors with the same instruction set from the same vendor, the changes may be smaller but if switching vendors and instruction sets, the changes can be significant. Embedded Operating Systems The majority of embedded systems run an operatingsystem just like general purpose computers. An operatingsystem is a piece of software that helps to streamline application development by providing a set of common system functions and managing hardware that applications can be built upon. Operating systems can provide a wide variety of functionality via an abstraction layer and programmers can quickly leverage functions such task management, manage memory, system resources, user interface components, file systems, networking, security, device management, and more rather than implementing them from scratch. In the general purpose space, Microsoft Windows variants, Apple's OS-X and Linux variants power the majority of desktop and server applications. 1% 1% 1% % -Mndows 35.40%- Mac 2.24% - ios 25 1.01% - Jav ME 0.94% - Lnux S.66% - Anrom Figure 6: OS market sharefor desktop computers6 6 From netmarketshare.com: http://netmarketshare.com/operating-system-market-share.aspx?qprid=8 Page 27 of 106 In the embedded space, about 70% of embedded systems use an operating system (UBM/EE Times Group, 2010). Just like embedded systems and embedded processors, there is much greater diversity in the breadth and nature of these embedded operating systems, again due to the diversity of problems for which embedded systems are designed to solve. For processors that are managing a diverse set of tasks like networking, user interface applications or system control, a larger operating system that has more building blocks for these types of functions -- like Linux or Windows CE -- may be more appropriate. For a device performing a fixed function like audio or video processing, it can potentially use a very small operating system like FreeRTOS or uCOS II. Figure 7 below shows the most commonly used embedded operating systems. .a um (e Nc ftO- Tivitm E ahMMsra a teps d st cm tnscb ndo Mam~b~ howya Figure 7: Most commonly used operatingsystems in embedded systemsd Sysemnd rocsso Diegsty~ With such a diverse set of operating system available and considering that 30% of embedded systems don't use an operating system, migrating embedded system designs between embedded operating systems can be time consuming. System and Processor Diversity The first important attribute is diversity. Embedded systems themselves are highly diverse and so are the processors and the operating systems that power them. Unlike the general purpose computing space which has been dominated by a few processor architectures (Intel and AMID) and operating systems (Windows, Linux, OS X), there are a number of companies developing a number of variants of embedded processors today which service both broad and narrow market segments demanding a great diversity in functionality, connectivity and performance. This diversity means that migrating between embedded processors and embedded operating systems often involves software rework. 7(UBM/EE Times Group, 2010) Page 28 of 106 More Limited Resources General purpose programmers developing x86 applications to run on Windows, Linux and Mac platforms enjoy a number of conveniences that programmers of embedded systems do not share. They rarely need to take into account memory limitations because modem PCs and servers have so much of it. Furthermore, virtual memory provides an abstraction layer that allows all applications to allocate and access massive amounts of memory without any knowledge of the actual hardware configuration. Embedded programmers on the other hand, typically have much smaller memories to work with and in many cases, need to manage this memory more carefully and manually, taking into account the actual hardware configuration. On-chip memory can be a dominant percentage of the die area for a product and has a direct impact on the cost of chip. More on-chip memory means a larger die size for the chip, which increases the manufacturing costs for the suppliers. By constraining their applications to a small size, embedded systems programmers can also fit them into less expensive processors thus bringing down their own system costs. The same tends to be true for clock speeds. A processor supplier may yield a small percentage of processors that run at a higher frequency which they sell at a premium. Embedded systems engineers can save cost not only by using lower-memory variants of processors but also by using slower variants as well. Platform "Stickiness" By optimizing for specific processors, firms' product designs for embedded systems can become more coupled to their processors. Essentially, embedded systems programmers are a faced with a key tradeoff in their designs between unit cost and design costs. By keeping their software in a high level language like Cand in a portable, modular structure, they can more easily move between processor platforms, reducing design costs. However, by optimizing their application for a platform, they may be able to reduce system cost by fitting into a less expensive processor, thereby lowering unit cost. A common example of this is hand-optimizing pieces of code in the native assembly language of the processor. The result is the software becomes more tightly coupled to the processor platform and thus the platform becomes sticky within the firm. Code reuse is an extremely common practice in embedded system design. In 2010, 78% of embedded projects reused code that was developed in-house (excluding commercial off-the-shelf and open source software) and only 14% of projects reported no reuse at all (UBM/EE Times Group, 2010). If a new project is to reuse existing software, and that software was optimized for Page 29 of 106 a specific processor, there will be additional incentive to keep the processor platform constant in the new design. Tacit knowledge also contributes to the stickiness of a processor. Because processors vary so widely in their architecture and more importantly, their design and debug tools, firms must learn how the processor architecture functions, how the development and debug tools work and how to solve problems. This tacit knowledge is built up over time. At the beginning of the learning curve, the firm is wrestling with new types of problems that may result in slipped schedules and sub-optimal system performance in the first product built around a new processor. However, once the tacit knowledge is established, it becomes an important asset - the firm has confidence that they can reliably and predictably develop products around a processor platform; this has a great deal of value when the time-to-market of the end product is important. There are several emerging modern programming languages that have been specifically designed for the programming challenges of multi-core and multiprocessing systems (Patterson, 2010). In the embedded world however, most projects are still written in C. The C programming language was developed by Bell Labs in 1971 (Richie, 2003) and in 2011, 62% of embedded system projects still use C as the primary language and this proportion has been more or less constant for the last 5 years (UBM/EE-Times, 2011). While adoption of these languages could help make engineers in embedded systems more productive when developing multi-core designs, they're not on the radar in the embedded space (UBMIEE-Times, 2011). Product Life Cycle & Legacy In several embedded market segments, the products which embedded systems are used in may have very long life cycles. It is not uncommon in industrial and military applications for a product to be on the market for a decade and the original model still enjoying sales ten years after its launch. An organization must typically preserve several knowledge assets related to these products while they're in production (and for several years following). Firms in these market segments with longer product life cycles must incorporate the risk of longer term obsolescence in their product selection criteria. They need to ask not only whether the processor will still be supported and sold in 5to 10 years but also if the company is likely to survive for 510 years. Since the firm will need to maintain its tacit knowledge for its existing products, there is an additional incentive to develop new products that will leverage their existing tacit knowledge that they need to maintain anyway. Page 30 of 106 Real Time Constraints A real time constraint or capability means that a system needs be capable of responding to certain types of events within a predefined amount of time. Hardreal time is used to describe real time requirements that, if violated, may result in a system level failure. For example, an embedded system controlling airbag deployment in a car will need to deploy the airbag within a certain amount of time after the collision is detected for the airbag to provide protection. Soft real time describes real time constraints that, if violated, may result in decreased system performance but not failure. For example, a system that decodes a video stream will need to decode each frame in a certain amount of time. If a single frame isn't decoded in time, the video may glitch but the system will be continue to run. Around 75% of embedded systems have some form of real time constraints (UBM/EE Times Group, 2010) and many systems have hard real time constraints. Designing systems with real time constraints and particularly hard real time constraints can be very challenging. If a processor is running an operating system and handling several different tasks, it may need to rapidly switch from one task to another to respond to an event associated with a real time constraint. Operating systems can be classified as a real time operating system (RTOS) if they can support fast task switching to respond to system events in a deterministic amount of time. If a system has been tuned to meet certain soft and hard real-time constraints, this may be yet another impetus to stick with the existing software / hardware design if possible. Platform Certification In cases where the end product will communicate over a network or be used in a safetycritical application like automotive or healthcare, there may be a certification process that the product must go through before it can be commercially sold. Mobile phones, for example need to go through certification to prove that they comply with wireless standards and won't negatively impact the wireless networks they'll be part of. Changing software and processor platforms often require recertification which costs time and money. Summary To summarize, embedded systems are extremely diverse and so are the processors and operating systems that power them. There is also a great deal of inertia around processor platforms, development tools, operating systems, real time performance and more. This inertia Page 31 of 106 emerges from the diversity of the technology, the switching costs associated with changing processors and tools, certification status, the level of reuse and the product life spans. This inertia can even inhibit the adoption of processors that may only require incremental changes to existing software. For more radical technologies like multi-core processors, this inertia is even more powerful, particularly because multi-core processor design requires very new types of knowledge. Programming parallel applications requires very different skills, particularly when the parallelism is realized in high levels in the structure of the application. Page 32 of 106 5. Forms of Parallelism & Concurrency To understand why multi-core programming requires a very different type of knowledge, it is important to understand how parallelism, inherent in multi-core processors, is typically implemented in processors and to what extent programmers need to manage and architect their applications around this parallelism. Granularity The level at which parallelism is implemented within a system is often described using the qualitative measure of granularity. Granularity in the sense of parallelism can be thought of as the size of the task utilizing a processing element (Bhujade, 1995). It can also be thought of as the amount of work done between synchronizing events between parallel processing resources (Tabirca, 2003). Coarse granularity refers to the allocation of large amounts of work to a processing element whilefine granularity refers to the allocation of small units of work to a processing element. An example of coarse parallelism could be the allocation of entire applications to different processing elements. For example, on a dual core processor, we could run a web server on the first core and a data collection/analysis application on the second core. An example of fine grain parallelism could be splitting the left and right channels of an audio processing algorithm across two computation units within a processor core. Fine grain parallelism typically occurs within low-level modules of an application and can be managed at the component or module level. However, coarse grain parallelism occurs at higher levels and must be addressed at the architectural level. Multi-core programming is fundamentally coarse grain parallelism which is why, as we'll see, it has more to do with architectural optimization than modular optimization. Types of parallelism There are two fundamental forms of parallelism. Data parallelism is the capability to operate on multiple pieces or streams of data in parallel. Task or instruction parallelism refers to the capability to run multiple independent instructions simultaneously. Parallelism can also be implemented at various levels in the processor design hierarchy. Bit level parallelism, for Page 33 of 106 example is very low level, often transparent to the programmer and affects memory reads and writes. Thread or task parallelism, on the other hand, exist at higher levels within the software architecture and tends to be more heavily managed by the programmer or the operating system. Bit Level Parallelism For a processor to perform an operation on a word that is larger than the native word length of a machine and the memory bus, the processor needs to perform multiple accesses to memory to retrieve the individual components of that word. In the case of a 16-bit processor performing a 32-bit operation, for example, the processor would need to first fetch the lower 16-bits of the word and then the upper 16-bits of the word. Bit-level parallelism means that the width of the data buses is increased to reduce the number of cycles required to fetch words that are larger than the native word length. This had been the dominant form of parallelism found in general purpose processors until about 1986 when 32-bit bus width became more main-stream (Culler, et al., 1999). However, bit level parallelism is still commonly used in the embedded space where 8 and 16-bit processors are widely sold. 32-&tand Abve an1twd Shoo 34% 37%- Figure8: 2008 MicroprocessorRevenue by ProcessorType (Databeans,2008) Bit-level parallelism is automatically handled by most processors. On many modem processor architectures, a programmer typically doesn't need to instruct the processor to fetch two 16-bit words. They can perform a 32-bit read and the processor will automatically perform two sequential reads of 16-bit values over a 16-bit memory bus. Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) means that a processor is able to execute multiple instructions in a single cycle. ILP can be implemented in a serial fashion whereby instructions are broken into smaller pieces of work that can be executed at a faster rate. This technique is known as "pipelining" and is commonplace in many modem processors. Page 34 of 106 A common analogy to describe a pipelined architecture is a factory assembly line. Imagine a single worker performing ten tasks and each task requires an equal amount of time to accomplish. Now imagine hiring nine more workers and lining them up so that each worker handles one of those tasks. The result is that it takes 10 steps to complete the work but the throughput is now ten times greater. ILP can also be implemented in a parallel fashion whereby a CPU may be able to execute more than one instruction at the same time. Flynn's taxonomy provides a useful means of categorizing the four types of parallel operations at the instruction level (Flynn, 1972). SISD (Single Instruction, Single Data) - a SISD processor is capable of executing a single sequence of instructions and operating on a single data stream. Most embedded processors which target simpler control applications utilize SISD architectures. SIMD (Single Instruction, Multiple Data) - a SIMID processor is capable of executing a single sequence of instructions on more than one data stream. SIMD is very useful in applications which contain multiple independent datasets that need to be processed in an identical fashion. Audio and image processing lend themselves well to SIMD, for example. When processing stereo audio, there are two data streams - left and right channels. A SIMD capable processor can execute one instruction and operate on independent data from both the left and right channels. For a video application, a SIMD processor could operate on several pixels simultaneously. In fact, the Intel MMX instruction set extensions are basically a SIMD engine with 8 sets of computational units (intel, 2000). MIMD (Multiple Instruction, Multiple Data) - A MIMD processor is capable of executing more than one instruction per cycle and operating on more than one data stream. A common example of a MIMD implementation is a superscalar processor which is still considered a single CPU but is able to dispatch instructions to multiple computation blocks. In the author's experience, flexible superscalar architectures can often be programmed fairly efficiently from high-level languages if the programmer understands the architecture, compiler directives and intrinsic functions to maximize the performance. Both pipelining and superscalar became popular in the 1990s (Culler, et al., 1999). The Pentium Processor was Intel's first processor that supported both pipelining and superscalar processing. Page 35 of 106 MISD (Multiple Instruction, Single Data) - a MISD processor is capable of executing multiple instructions on a single stream of data. There aren't any mainstream MISD processor architectures on the market; however, systems can be configured in a manner which reflects this architecture. For example, a safety critical application may have two processors operating on a single stream of data. The output from each processor is compared and if the results differ, it indicates a system error (software or hardware) has occurred in one of the processors and the output data is invalid. In some cases, firms will use two different processor architectures to operate on the same data to protect against a situation in which a silicon anomaly on the processor results in the calculation of matching but erroneous results. For the programmer, instruction level parallelism within the processor is useful for finegrained parallelism in the application and can be managed by directing the compiler via compiler directives. Programmers who understand the architecture may be able to achieve higher performance. The manner in which data is organized for example, may allow for more aggressive use of instruction level parallelism. A SIMD processor like Analog Devices SHARC processor can operate on two data streams if the data is interleaved in a single array. However, if there are two distinct arrays holding the two data streams, an architectural decision that may be more intuitive and clean to the programmer, the processor is unable to utilize both computational units because the memory system cannot support loading operands from non-interleaved memory into the core in a single cycle. Also, most compilers provide extensions via #pragma statements that allow the programmer to provide additional information to the complier about the nature of the data so a compiler can safely parallelize certain regions of code. Instruction and bit level parallelism are typically confined within low-level software modules and may improve the performance of that module but don't typically impact the system architecture. Task Parallelism & Operating Systems Task parallelismrefers to the distribution of coarse grain tasks at the operating system level across processor cores in the system. In contrast to instruction and bit level parallelism, task parallelism can exist at both the modular level for finer-grain parallelism and architectural level for coarser-grain parallelism. A single programmer may be able to manage instruction and bit level parallelism but on larger systems, coarse-grain task level parallelism may affect several programmers due to its coarse grained nature. Page 36 of 106 There are three very important elements related to task-level parallelism. Dominant Design For Operating Systems The first element is that a dominant design has emerged around the way task-level parallelism is implemented within operating systems. The design paradigm consists of processes and threads. A process is a set of instructions that make up a single instance of a program; a multitasking operating system can support several processes. A process is composed of one or more threads. A thread is similar to a process in that it is a set of instructions but threads running within a process share the same memory space. A process may consist of a thread that manages user interface interactions and a second thread that manages data processing. The user interface thread can store state information related to the user interface in this shared memory space which the data processing thread can then read from. Different types of thread and processes and thread configurations are shown in Figure 9 below. n- , Figure 9: Threads andProcess - Possible Configurations? Applications built upon a collection of threads are referred to as multi-threadedprograms; monolithic applications that aren't broken down into threads that the OS can manage are referred to as single-threadedprograms. Reciprocal Support in Processors This design has become prevalent enough in both embedded and general purpose computing that functions to facilitate this design have been implemented in processors like hyper threading and symmetric multiprocessing (SMP). This is the second key element related to task-level parallelism. 8Image from: http://www.cs.cf.ac.uk/Dave/C/node29.html Page 37 of 106 SMP, which is covered in more detail in the next section, allows the threads that make up a process to be distributed across multiple cores either manually by the programmer or automatically by the operating system. This is very powerful because it means that programmers can follow the design practices they used for programming multi-threaded applications for single-core processors and the operating system can make decisions about on which core to run a given thread in real time. Hence, applications don't necessarily need to be completely repartitioned to run on a multi-core device. Writing multithreaded applications for multi-core processors does introduce several new challenges for programmers. Code becomes more difficult to predict than a traditional single threaded application because a preemptive operating system may interrupt threads at arbitrary points and switch to other threads. "Afolk definition of insanity is to do the same thing over and over again and to expect the results to be different. By this definition, we in fact require that programmersof multi-threadedsystems be insane. Were they sane, they could not understandtheirprograms." - EdwardLee (2006) Software modules that utilize synchronization locks are not composable meaning that two correct pieces of code cannot be combined to form a single, larger piece of correct code (Lee, 2006). The implication is that as we attempt to combine software using locks into larger pieces of software, new failure modes will emerge. This can be particularly troublesome if some of these pieces of software originated outside of the organization and source code isn't available. Common failure modes with locks are conditions called deadlocks and livelocks. A deadlock can occur, for example when a thread on one core has locked one resource and is attempting to lock a second. On another core, a thread has locked the second resource and is attempting to lock the first resource. At this point, neither thread can proceed because they are both waiting for the other thread to release a resource. ThreadI 2 Thread void update1() void update2() acquire(A); acquire($); <<< variab1le++x release(a)I release(A)t acquire() Thread I waits here acquir.(A); <<< Thread 2 waits her* variable1++) release(B): release(A); Page 38 of 106 Listing 1: Example of a scenario that will lead to deadlock9 A livelock, on the other hand, occurs when two threads enter an unending loop of acquiring and releasing. Livelocks can occur when attempting to write code to prevent deadlocks (Gove, 2011). Application Portability Across Operating Systems and Architectures The third element is that this design paradigm makes it easier to port applications between operating systems which are built around this design. Furthermore, the POSIX (Portable Open System Interface) standard has emerged which provides a common threading API. This further improves the portability of applications across not only single core architectures but in some cases, multi-core architectures. The POSIX standard was defined in 1988 and aimed to improve the portability of applications across platforms by providing several common functions to Application Programming Interfaces (APIs) (The Open Group, 2006). Its intent is to make porting a POSIXcompliant application from one POSIX-compliant operating system to another, extremely easy. Several of the top embedded operating systems shown in Figure 7 on page 28 are POSIX compliant including QNX Neutrinolo, VXWorks", and Nucleus. While Linux is not fully POSIX-compliant, it is highly compliant and POSIX compliant applications can be typically be ported to Linux with little or no modification (Locke, 2006). The POSIX API covers several different types of functions as shown in Figure 10. 9 (Gove, 2011) '0 http://www.qnx.com/news/pr 959 L.html " http://get.posixcertified.ieee.org/cert prodlist.tpl 12 http://www.mentor.com/companv/news/nucleusposix07 Page 39 of 106 stadard New Document SO N-ercszi t POSIX1 1003.1 POSIXI O101 POS1X2 1003.1 POSX.3 1003.1 POS14 1003.1b amicloelfaces 9945-1 Exte10ons ISO C Testmmdhods Redtime POSiX.4a 1003.lc Ieads POSX4b1003.1d POSIX5 1003 1 POSIX6 1003.1e POSIX.7 1003.1 POSk.8 1003.1f POSIX.9 1003.1 POSIX10 1003.1 More Reie ADABieda POSIX.1211003.lg Socet POSDL1311003.1 POSIX 15 1003.1 POSIX17 IEE 1224.2 Real TmneProfes Ba-ch/suprcomu er extensions NetwokDrectory/amme savices Seciky SystM Adnisaraion NetwaxkFleAccess 77 Bindg Supercoing Figure 1O:POSIXAPIs13 A subset of the POSIX standard addresses the standardization of threads (1003. 1c) called POSIX threads orpthreads. The pthreads library contains around 60 functions for thread management. The API includes functions for creating and destroying threads as well as functions for passing data between threads and synchronizing the threads (Gove, 2011). As noted earlier, pthreads can be used on single core and multi-core processors and have become a "standard commodity" in multi-core applications development (Freescale Semiconductor, 2009). Summary To summarize, task and bit level parallelism are common on single-core processors and usually implemented at low levels in the system so the improvements are modular in nature. Task level parallelism is implemented at both the modular level (fine grain parallelism) and at the architectural level (coarse grain parallelism). Task level parallelism is implemented in a common fashion across most operating systems today, which allows for greater portability of applications between operating systems built around this design. Furthermore, when properly supported by the operating system and the hardware, applications can be ported between single core and multi-core devices without fundamentally rearchitecting the software systems. While this solution doesn't apply universally as we'll see in subsequent sections, it is a critical component in the adoption patterns. "3http://www.comptechdoc.org/os/linux/programming/c/linux pgcintro.html Page 40 of 106 6. Attributes of Multi-core Processors In the last section, we examined forms of parallelism in processor architectures and how task parallelism has been standardized across operating systems. The architecture of a multi-core processor has significant impacts on the challenges associated with adoption. As noted earlier, reciprocal functions in the processor like SMP can ease the challenges of adoption. However, it's not universally applicable and has fundamental limits. This section will examine several important architectural attributes of multi-core processors that differentiate them from single core devices and from each other, such as the type and number of cores, how resources are shared and the architecture of shared resources, and how multi-processing takes place. Each of these attributes directly affects the challenges associated with adoption. Number of cores By definition, a processor which contains at least 2 cores is a multi-core processor. Multicore processors that have on the order of tens, hundreds and even thousands of cores in a single package are known as many-core processors. Types of Cores - Homogeneous and Heterogeneous A system that contains multiple cores is said to be heterogeneous if the processor core architectures and instruction sets of those cores are different. Conversely, a system is called homogeneous if it contains multiple identical processors cores. Homogeneous processors may contain 2, 4, 8 or even 100 processor cores with same instruction set architecture and can be applied to general purpose computing applications or embedded applications which need higher levels of performance than a single core device can provide. Heterogeneous processors can often provide a more optimized platform in that different types of processor cores are optimized for different functions within the system. Many modem System-on-Chip (SoC) devices utilize a heterogeneous collection of cores and/or hardware accelerators which are each optimized for different system tasks. Page 41 of 106 The Intel media processor SoC shown in Figure 11 is designed to power DVD players, TVs and cable set top boxes and it contains several different types of processors that are optimized for the different functions that this device needs to perform. For example, a dedicated DSP is used for audio processing while dedicated display and graphics processors are using for various video functions. The Pentium M processor likely controls the entire system and elements like the user interface, networking stacks, etc. I-i Emi Figure 11: Intel Media ProcessorCE 3100 Bock Diagram" Several existing embedded systems already use a heterogeneous collection of processors typically because certain tasks within an embedded system can run significantly more efficiently on one type of processor over another. In a heterogeneous processor architecture, each processor will typically run its own, unique application and the degree of modularity at the system level is typically high - there is generally a significantly lower level of interaction between the processors compared to the level of interaction between software modules running on a processor (Levy, 2008). Resource Sharing Processor cores can be collocated on a single, monolithic silicon die - these are referred to as chip multiprocessors (CMP). In a CMP configuration, the cores may share several of the same resources including I/O peripherals, memory, and more. Figure 12: Apple A5 dual-coreARMA9' 14 http://download.intel.com/design/celect/downloads/ce3 5 100-product-brief.pdf 15http://www.abdulrehman.net/wp-content/uploads/2011/03/Apple-A4-and-A5.ipg Page 42 of 106 Multiple processor cores can also be on different silicon die but integrated into the same package as a multi-chip module (MCM) as Intel did with their Yorkfield processor shown in Figure 13 below. Figure 13: Intel "Yorkfield" Quad-core MCM 6 Resource sharing can have a significant impact on application performance on a multi-core processor and it can also affect the scalability to a greater number of cores. For examples, if all cores share a common external memory bus, the amount of bandwidth that each core has decreases as the number of cores increases. Memory Architecture The memory architecture of a multi-core processor can vary widely and can have a significant impact on the application performance if not understood or exploited properly. A key characteristic of memory is latency-- the time required for a processor to perform a read from the memory system. A memory system with a low latency can be read and written to faster than a memory system with high latency. As memory is located further from a processor core, the latency increases because it physically takes more time for the control and data signals to travel between the core and the memory system and for a value to be returned from memory. When memory is located off chip, it typically has a much higher latency than memory located on chip. On larger general purpose processors, external memory accesses can require hundreds of processor clock cycles. For example, an AMD Athlon 64 at 2.0GHz with DDR-400 memory has a memory latency of 50ns". At 2GHz, the cycle time of the processor core is 0.5ns which means the core waits around 100 cycles for a read to complete. An example we will examine below, utilizes an ARM Cortex-A9 which has an external latency closer to 20 cycles. External memory latency is particularly important for multi-core processors because these devices typically have one external memory bus that the cores share. As noted above, shared resources can degrade performance. For example, if 2-3 cores try to access external memory 16 http://hothardware.com/articleimages/Item1289/small 17 http://www.anandtech.com/show/1 Intel-08200S-0955S-yorkfield.ipg 610/6 Page 43 of 106 simultaneously, some cores will need to wait for several hundred clock cycles for the access to complete. The memory system is typically organized in a hierarchical structure - there is a small amount of on-chip memory close to the processor that can be accessed at the clock speed of the processor which is referred to as Li memory. Some processors also have a larger amount of onchip memory located further from the core with a longer latency called L2 memory. Most processors also support external memory which is sometimes referred to as L3 memory. Some types of embedded processors that are designed for a broad variety of application tasks like ARM, MIPS and PowerPC families of devices, utilize Li and L2 as cache memories. Some embedded processors that are more optimized for real-time and signal processing applications may offer the flexibility of using LI memory as either cache or random access memory. The Blackfin processor from Analog Devices, for example, allows Li to be configured either as cache or RAM18 . This option exists because it isn't always necessary to use external memory in an embedded system - applications can be small enough to fit within the internal RAM of the device. Faster layers of memory closer to the processor core serve as cache for slower layers of memory further from the core. If the core fetches a value from external memory, it can also be stored in LI cache. When the processor needs to operate on that value again, it can access the value in LI cache which can be significantly faster than L3. The diagram in Figure 14 shows the memory hierarchy for a system with an LI and L2 cache and external L3 memory along with the example latencies. Onchip Procorssor' L1 Cache H L2 Cache Offchip L3 Memory 1 cycle -- -- 7cycles r ---- 20cycles Figure 14: Memory access times (core clock cycles)for different levels ofmemory 9 Consider an ARM Cortex A9 with 32KB of Li cache, 512KB of L2 cache and external SDRAM operating at 100MHz (IOns cycle time). LI cache can operate at the clock speed of the " Blackfin Datasheet contains information about configuring LI memory as SRAM or cache: http://www.analog.com/static/imported-files/datasheets/ADSP-BF53 1_BF532_BF533.pdf '9 Adapted from a diagram in (Gove, 2011) Page 44 of 106 processor. As can be seen in Figure 15 below, data sets smaller than 32KB can fit in Li cache and can be accessed with a latency of IOns once they've been cached. As the data set increases beyond the size of Li cache, the L2 cache is used which results in larger latencies. At a certain point, the dataset is larger than both caches and the benefits provided by cache are no longer perceptible and most accesses to external memory incur the full 200ns latency. 250 200 Additional Latency (cycles) 150 -0-4 100 -2 0 4 16 64 256 1024 4096 SizefKej 20 Figure 15: Memory Latency on ARM Cortex A9 Shared Memory Common memory architecture for multi-core processors is sharedmemory architecture, which follows the hierarchical paradigm presented above. In this configuration, several processors can uniformly access the same memory via the same memory addresses. This approach provides several advantages. It simplifies the programming model because each core can operate on the same dataset by placing the data set in a shared region of memory rather than creating redundant copies of the data set and placing these in each core's local memory. Shared memory is also a means for tasks running on different processor cores to communicate with each other. For programmers migrating from a single core design, tasks now spread across different processor cores can share memory just as they did in the single core design. 20 http://www.ruhr-uni- bochum.de/integriertesysteme/emuco/files/System f Level Benchmarking Analysis of the Cortex A9 MPCore.pd Page 45 of 106 Figure 16: SharedMemory Architecture In many multi-core systems, the processors also have their own local cache memories as seen in Figure 17. If Core 0 has retrieved a value from memory, it will be cached in its local cache memory. If Core 0 then modifies this value, the value stored in cache no longer matches the value stored in memory. If Core 0 were the only processor in the system, the cache could provide a write back operation at some point in the future to synchronize the memories. However, in a multi-core system, Core 1 may access the 'old' value in memory on the cycle after Core 0 modifies the value in its local cache. A solution to this problem is developing a mechanism to allow the cache memories to synchronize. Cache coherency is an architectural feature that allows processor cores with local cache memories to maintain coherency across memory. Core 1 Core0 Cache : Cache Shared Memory Figure 17: SharedMemory Architecture with Cache Coherency Cache coherency in shared memory systems improves the performance of a shared memory system by adding the benefits of caching. However, the challenge associated with keeping the cache memories synchronized grows rapidly as the number of cores increases. Intel refers to this as the "coherency wall" - an increase in resource sharing can have adverse performance effects as the number of cores is increased (Matson, 2010). Today, most embedded processors with a shared memory architecture don't support coherency across more than 4 cores. For example, the ARM Cortex A9 as well as the forthcoming ARM Cortex A15, for example, can support memory coherency across 4 cores21 . 21 ARM Cortex A9 Specs: http://www.arm.com/products/processors/cortex-a/cortex-a9. hp ARM Cortex A15 Specs: http://www.arm.com/products/processors/cortex-a/cortex-a 15.php Page 46 of 106 Distributed Memory The distributedmemory architecture is one in which processors each have a private local memory and the cores are linked via an on-chip network. A key benefit of these architectures is scalability in part because the coherency wall doesn't apply because cache memory doesn't need to be synchronized between all of cores. However, these architectures can be more difficult to program than shared memory architectures for a few reasons. Memory isn't implicitly shared which means that data needs to be explicitly managed. Also, communication between cores becomes more explicit as well. With a shared memory model, one core could write to a shared location and several other cores could read that value. In a distributed model, the first core needs to explicitly communicate with each core. Loala Local Mem orvMemory [E . Core3 Figure 18: DistributedMemory Architecture Hybrid Variants The shared memory and distributed memory architectures are two extremes and there are hybrid implementations that borrow from each. In some forms of distributed memory architectures, the cores can access shared memory over the on-chip network. Tilera utilizes a distributed cache architecture using a mesh-based distributed memory architecture that supports cache coherency between groups of cores 2 . Multi-Threading As the number of processor cycles required to complete external memory accesses has increased, processor cores can spend more time simply waiting for data reads to complete. On a typical single core processor running a preemptive operating system, processes and threads share time slots on a single CPU. When a process / thread performs an external memory read, the entire system must wait for that operation to complete. 22 http://www.tilera.com/sites/default/files/productbriefs/TILEPro64 Processor PBO19 v4.pdf Page 47 of 106 Vertical multi-threadingis an architectural technique to make more productive use of cycles spent waiting for a long latency memory operation to complete. A processor with multithreading capabilities has two or more sets of registers to store thread state. While one thread is stalled due to a long latency memory read, the other thread can execute. An architecture which supports Simultaneous Multi-threading(SMT) can support concurrent execution of multiple threads. The figure below shows the difference between single threaded, vertical multi-threaded and simultaneous multi-threaded execution. Single thread timeline Process 1finished Thread Process2 Mnraryas Process 2 finished MA m Proces2 Multi thread timeline (SPARC64" Vi): VMT Process 1finished Threadl memo Thread12 Procse2 MoImmync00s Process Process 2 finished Muli tred teline (SPARC64" VWVII+) : SMT Process 1 finished Threadl Process 2 finished Thread2 Pmcss2 Memoryaus Prooss2 I Figure 19: Single Threadingvs. Super-threadingvs. SimultaneousMultithreading" A processor architecture which supports SMT, presents itself as a multi-core architecture to the operating system despite the fact that there may be only one core. A dual core processor that can support 4 threads per core will appear to the operating system as an 8 core processor. Asymmetric Multiprocessing (AMP) and Symmetric Multiprocessing (SMP) Asymmetric multiprocessing(AMP or ASMP) configuration is one in which cores run their own operating systems and applications and are controlled independently, much like single processor systems as shown in Figure 20 below. The programmer must explicitly manage the sharing of resources, the communications between the cores (Leroux, et al., 2006). AMP is good for existing, monolithic legacy code where a legacy application can placed in one core and an operating system running in a second core. 23 http://www.fuitsu.com/global/services/computin/server/sparcenterprise/technology/oerformance/processor html Page 48 of 106 CPU 0 CPU 1 Shared Memory System InterconnectOS1M mr or Cl nt 1 re 24 Figure 20: AMP Configuration In contrast, a symmetric multiprocessing(SMP) configuration is one in which a single instance of an operating system is run across several cores and the operating system controls all system resources like memory and 1/0. In this configuration, processes and threads can be assigned to different cores dynamically, based on loading levels of the different cores or statically by the programmer. POSIX- POSIX- POSIX- Compliant Application Compliant Application Compliant Application Operating System POSIX-Compliant with SMP Support CPU CPU System Interonnec~t Cont roller Memory 25 Figure 21: SMP Configuration 24 25 Adapted from a diagram in (Leroux, et al., 2006) Adapted from a diagram in (Leroux, et al., 2006) Page 49 of 106 Essentially what this means is that a POSIX-compliant multithreaded application can be ported to a POSIX-compliant SMP operating system and take advantage of multiple cores often without requiring a fundamental reachitecture of the system which will be explored in the next chapter of this thesis. There are still several challenges that exist for programmers but in many cases this allows them to preserve most of their existing knowledge and architecture and ease into multi-core development. Several mainstream embedded operating systems support SMP including Linux26, VX Works 27 , QNX28 , Enea 29 and more. There are several types of embedded multi-core processors today that support SMP. There are several processors built on the ARM 11, ARM Cortex A9 and forthcoming ARM Cortex A15 cores. Included in the list of multi-core Cortex A9 processors are the Apple A5 which powers the iPad230 , the NVidia Tegra 2 which powers the Motorola Xoom tablet running Android v3/Honeycomb 31 , the Samsung Exynos32 , Texas Instruments OMAP433 , and several more. In the second group, we have standard MCU designs that are not based on ARM cores but do support SMP. Examples are the PowerPC and Freescale MPC745x family based on the e600 core. Finally, there is an emerging class of many-core processors, such as Tilera's TilePro family of 32-100 core processors, which support SMP as well3 4 . Multi-core processors represent a significant departure from single-core processor architectures and there is a significant amount of knowledge, tools, and product built around single core architectures. SMP is special because it is essentially allowing the entire software industry to maintain its current knowledge and software assets while realizing some level of increased computational performance and battery performance from multi-core architectures. The Futureof SMP The processor industry has been speaking about "a day of reckoning" in which programmers would need to abandon the sequential programming practices that have become so ingrained 26 http)://www.ibm.com/developerworks/library/1-linux-smp/ 27 http://www.windriver.com/products/platforms/smp/ http://www.qnx.com/news/pr 1962 1.html http://www.eetimes.com/electronics-news/4136532/Enea-debuts-multicore-OS-combining-AMP-SMPkernel-support?cid=NL Embedded 30 http://www.apple.com/ipad/specs/ 31 http://www.anandtech.com/show/4 11 7/motoroIa-xoom-nvidia-tegra-2-the-honeycomb-platform 32 http://armdevices.net/20 11/02/11 /samsung-orion-exynos-4210-arm-cortex-a9-in-production-next-month/ 3 http://focus.ti.com/pr/docs/preldetail.tsp?sectionld=594&prelld=sc0902I 3 http://www.tilera.com/products/processors/TILEPRO64 28 29 Page 50 of 106 over the last several decades and deal directly with the parallelism. This was the topic of a famous article by Herb Sutter in 2005 entitled "The Free Lunch is Over" (2005) and was the heavily discussed topic at the opening day of the 2011 Multi-core Expo (EE-Times, 2011). For the last four decades, the processor industry has delivered increasingly faster processors year on year. Software developers have been able to rely on newer and faster processors to increase the performance of their products by virtue of the fact that their existing code would typically run faster on faster processors. Intel Processor CockSpeed(MHz) 1$s 197 1979 1984 1190 195 00 20 Figure22: Intel ProcessorClock Speed by Year35 Between the operating system and the architectural features of modem processors, programmers today operate within a beautifully abstracted environment in which their processes have the illusion of having massive amounts of memory, uninterrupted access to a processor and protected, shared memory across all threads. SMP has essentially allowed chunks of the embedded systems industry to begin adopting multi-core processors while continuing to work in the threaded software paradigm. In essence, the portions of the industry that have been developing applications atop operating systems that support SMP, have been able to treat multi-core almost as an incremental innovation. SMP is allowing the widespread programming practices used on single core processors to persist on multi-core machines (Lee, 2006). While SMP and the operating systems which support it, offer an effective abstraction layer to allow multithreaded programs to migrate to multi-core devices, it is known to have limitations. Many suggest that SMP doesn't scale beyond 4-8 cores (Gwennap, 2010). What happens next? 35 http://smoothspan.wordpress.com/2007/09/06/a-picture-of-the-multicore-crisis/ Page 51 of 106 7. Embedded Multi-core Processors Adoption The preceding sections have provided a foundation by characterizing the dynamics of adoption, the key characteristics of embedded systems, the general forms of parallelism and concurrency and the key attributes of multi-core processors. This chapter explores the patterns of adoption that are observed for multi-core processors in embedded systems that are primarily driven by technical and architectural considerations; the following chapter builds on and expands on this by considering broader system level factors and challenges affecting the adoption. The first part of this chapter explores the important factors that influence the adoption of multi-core processors in embedded systems. Adoption depends on the balance between their benefits, such as either or both higher performance and lower power, and the challenges, such as architectural issues, generality, code re-use, interactions amongst sub-systems and optimization and debugging. The second part of this chapter builds on this by identifying four key adoption scenarios and the likely outcome in each case. Benefits of Multi-core in Embedded Systems Performance Embedded multi-core adoption is widely observed in performance driven applications where the problem could not be solved as economically with other technologies (Gentile, 2011). Examples of high performance embedded applications that may require the use of multi-core processors include wireless base stations, test & automation equipment, high performance video, medical and high-end imaging devices and high performance computing (Texas Instruments, 2011). There are also several existing applications that have historically used multiple processors in conjunction to solve problems. Phased array radar, cellular base stations and networking equipment are few such examples. In application areas like these that have historically required the partitioning of such problems across multiple processors, there are typically a few 'wizards' in the organization who are well versed in architecting and programming multiprocessor and multi-core systems, as well as the problem solving techniques needed to effectively develop and deploy high performance applications. Page 52 of 106 The problem is that such "wizards" represent a tiny minority of all embedded programmers out there. Only 17% of engineers surveyed in VDC Research's 2011 Embedded Market study had more than 4 years of experience with multiprocessor or multi-core design (2011) and several of these likely fall into the design pattern case above. Power Dissipation Utilizing multiple cores at a lower clock speed but with higher aggregate computational capabilities can actually dissipate less power than a single core system running at a higher clock frequency. For example, the power dissipation values for the single and dual core variants of the Freescale MPC8641 processor are shown below in Figure 23. The dual core configuration running at 1.0GHz with a core voltage of 0.9V dissipates less power than a single core configuration running at 1.5GHz with a 1.1V core voltage. In this example, the dual core is providing a 33% increase in theoretical performance yet it requires around 16W instead of 20W. 25 0 5i 01 Figure23: Power Consumption Comparisonof single and dual core implementations of the Freescale MPC864136 There are two types of power dissipation in processors: static and dynamic. Static power dissipation originates from current 'leaking' through gates when they're not switching. Dynamic power originates from gates switching in the device. Dynamic power dissipation in a processor is a function of switching frequency, the amount of capacitance on the gates, and the voltage. The equation below is used to estimate dynamic power dissipation where A is the 'activity factor' or the percentage of gates that are switching; C is capacitance that the gates are driving; V is the voltage and F is the frequency. Pdy, 36 , = ACV 2p (Freescale Semiconductor, 2009) Page 53 of 106 As frequency increases, our dynamic power increases linearly. However, high clock speeds may require a higher core voltage. Figure 24 below shows an example of the dynamic current used by the Blackfin processor at different clock frequencies and core voltages. As clock speed increases, the core voltage needs to increase and as a result, power increases exponentially with a linear increase in clock speed. Fh'q y 0. 0 V 0.8 5 V O.90V O.9SV 1.00V 1.SV 1.10V I I.SV .20V 1.2Vj 1.30V 1.32 V 1.37S V 50 12.7 13.9 15.3 16.8 18.1 194 100 200 250 300 73 400 423 475 50 22.6 40.8 50.1 N/A N/A N/A N/A N/A N/A N/A N/A 242 44.1 538 63 N/A N/A N/A N/A N/A N/A N/A 26.2 46.9 57.2 67.4 N/A N/A N/A N/A N/A N/A N/A 28.1 50.3 61A 724 88.6 93.9 N/A N/A N/A N/A N/A 30.1 53.3 64.7 76.2 91., 993 N/A N/A N/A N/A N/A 31.8 34.7 36.2 38.4 40.5 430 43.4 45.7 569 59.9 63. 66.7 70.2 73.8 75.0 78.7 68.9 729 76.8 81.0 851 893 908 95.2 810 85.9 906 92 100,0 104.8 106,6 111.8 99.0 104.6 116.0 122.1 128.0 1300 136.2 1050 123.0 1294 135.7 137.9 144.6 111,0 117.3 123.S 129.9 136.8 1456 1526 N/A 1303 1368 141.8 151A 158.1 161.1 168.9 N/A N/A 143.S ISO.7 1656 168.8 177.0 N/A N/A N/A 160.4 168.8 1176.5 179.6 188.2 N/A N/A N/A N/A N/A 196.2 1199.61203 533 600 21.0 223 24.0 254 26.4 I10.3 11081168 138.7 27.2 287 4IA3VI1ASV 303 47.9 82.4 99.6 116.9 142.A 07 48.9 84.6 102.0 119.4 1145.5 1543 151.2 1597 162.8 176.6 179.7 185.2 188.2 196.8 1200.5 219.0 222.6 Figure24 Dynamic Current over Frequencyfor the ADSP-BF533 Embedded Processor3 7 If frequency is reduced, the core voltage can typically also be reduced which leads to the power savings shown in Figure 23. Size / Density / Cost When compared to a multiprocessor system, multi-core processors often cost less and take up less PCB release estate than the equivalent multiprocessor system. They cost less because they require less material - a single package and die in the case of a multi-core device, versus multiple packages and multiple dies in a multiprocessor system. For example, the single-core ADSP-BF533 embedded processor has a budgetary pricing of $14.46 in the BGA package while the dual-core ADSP-BF561 embedded processor has a budgetary pricing of $20.40 in the BGA package. Two ADSP-BF533 devices will cost almost $28.92 versus $20.40 for a dual core device38'39. Multi-core devices can also be smaller than the equivalent multiprocessor system. For example, the TI C6474 tri-core DSP has a 23mm x 23mm package size while the older singlecore version, the C6455, has a 24mm x 24mm package size . Granted, the C6474 is fabricated at a 65nm process node while the C6455 is fabricated at 90nm which means that the same http://www.analog.com/static/imported-files/data sheets/ADSP-BF53 I BF532 BF533.pdf http://www.analog.com/en/processors-dsp/blackfin/ADSP-BF533/processors/product.html 39 http://www.analog.com/en/processors-dsp/blackfin/ADSP-BF561/processors/product.htm 40 http://www.eetimes.com/electronics-products/embedded-tools/4108275/Multi-core-DSP-offers-3-GHzperformance-at-i -GHz-price 37 38 Page 54 of 106 circuitry will be almost half the size 41; however, the PCB area consumed by three C6455 devices is 1728mm2 while the PCB area consumed by a single C6474 device is 529mm 2. 4 2 Three cores in a single package is 30.6% the size of the three single-core devices and this isn't factoring in the supporting components for individual devices like decoupling capacitors. Architectural Factors & Challenges Despite the benefits, migrating to a multi-core processor is typically not a trivial task. There are several technical issues related to both the structure of the application, or the manner in which the application is partitioned across the cores, and the dynamics that emerge from new kinds of interactions that are possible on multi-core devices that share hardware resources and run software concurrently. It's worth noting that when looking just at hardware / software systems, there are several innovations that appear to be incremental or modular within the scope of that system. However, when taking into account the full hardware/software system, innovation is almost always architectural or radical to some degree. It's almost impossible to change software without changing the nature of the interaction with hardware, as miniscule as it might be and vice versa. For example, an incremental innovation in hardware like an increased clock speed of a processor can change the timing of the entire software application and the software may need to be retuned to run optimally. The following section will explore architectural challenges related to multi-core adoption as shown in Figure 25. It will first examine the challenges related to the structure of the system which, from a software perspective, is really how the application is partitioned across the cores. From here, it will explore challenges related to the dynamics of the system which emerge from the new types of interaction between software as well as between software and hardware that become possible in multi-core systems. 2 2 2 4' 65 2nm=4225nm ; 90 nm=8100nm 2 42 23mm x 23mm x 1 = 529mm ; 24mm x 24mm x 3 = 1728mm 2 Page 55 of 106 Structure - Partitioning Dynamics -- Interactions Architecture System Optimization & Debug Figure 25: Multi-core architecturechallenges Structure / Partitioning One of the biggest questions that programmers confront is how to decompose and rearchitect an application so that it can be effectively spread across multiple processor cores and deliver increased benefit over existing or alternative implementations. For applications that don't have a great deal of obvious data or task parallelism (or both), determining how to segment an algorithm becomes increasingly difficult because the partitioning boundaries aren't always readily recognizable. Furthermore, if the wrong decomposition strategy is used, it's possible that the application could actually deliver lower levels of performance than a single-core implementation. The problem gets increasingly difficult as the amount of reuse and legacy code increases because, as we'll see, legacy code tends to be more integral in nature. "More cores don't necessarilyyield more processingpower, as a matter offact, adding cores may impairperformance if the resultingdevice is not properly balanced," - Pekka Varis, CTO of Multi-core and Media Infrastructureat Texas Instruments43 Decisions made about decomposition and architecture will typically become increasingly expensive to change later in the design as described in the section entitled Decomposition & Modularity on page 16. What makes this step particularly difficult for multi-core processors is the increased difficulty we have predicting behavior and performance, a topic that will be discussed later in this section. ' http://www.sys-con.com/node/ 1802536 Page 56 of 106 The difficulty in selecting the right decomposition strategy can vary significantly by application. There exist classes of applications for which the decomposition of the problem into pieces than can run concurrently is more obvious. The first class of applications can be characterized as having a high degree of data parallelism within the scope of the application task and these are commonly referred to as "embarrassinglyparallel" applications. For example, certain applications with high degrees of data parallelism can be more cleanly partitioned across multiple cores, particularly when the amount of functional interaction required between the cores is low compared to the functional interactions within the cores. In an image processing application, a 4 core system could be configured so that each core processes one quarter of the image. The same technique can be applied to video processing. The first generation of Cisco's Telepresence system (shown below in Figure 26) utilized a large number of Blackfin processor cores in a parallel configuration to encode and decode 3 streams of 1080p H.264 video with extremely low latency44 Figure26: Cisco Telepresence System45 For these types of applications, Amdahl's Law, named after computer architect Gene Amdahl, provides a theoretical limit to the performance that can achieved. Amdahl's law states that the potential speed-up of an application as the number of cores is increased will ultimately be limited by the sequential code in the application which cannot be parallelized. As more cores are added, the execution time for parallel sections of the code will become negligible but the execution time for the sequential code will remain constant. The equation below provides an estimate on the amount an application can be sped up as a function of P, the percentage of time that the processor is executing parallel code and N, the number of cores (Amdahl, 1967). More information on the usage of the Blackfin processor in this design can be found in the ADI press release: http://www.analog.com/en/processorsdsp/blackfin/content/cisco telepresence vision becomes reality/fca.html?language dropdown=en 45 Image retrieved from: http://newsroom.cisco.com/dlls/2009/prod 022009.html. 44 Page 57 of 106 1 Speedup = (1 - P) + N Equation 1: Applicationspeedup as a/unctionof cores andparallelizablecode (Amdahl's Law) For example, if an image processing application consisted of 70% parallel code and it was run on a 4-core machine, the expected speed up would be 2.1 1x. 1 = 2.11 1 (1 -. 7)+'7 Speedup = Figure 27 below shows the relative increase in performance over a completely serial algorithm. Speed up relawe to auijal case - o 4 2 0 6 NMnAe 10 12 s T 141 to - t Of tweadk 4 6 Figure There is tasks operating even if is nature its to Examples arrive (Sutter, 4 until processing of be a applications, al., streams processor computer chip to tasks be and network assigned percentage as These very to of packets to or cores other code. Gustafson This web with server available several thus is similar applications each and (Gustafson, a of because each code having types sequential segment John ofparallel characterized related a scientist be is aren't completes can threads can task that process of simultaneously. the tasks after number which within sequential named by applications data another these et the set would In of scaling independent own Law, Performance class of 'wait' Gustafson's requests. second on the running needs a 27: scale processor no known core processor as 1988). processing web bandwidth as they 58 of 2005). (Gove, 2011) Page 106 Generality vs. Performance Tradeoffs Generality is a characteristic of software which describes how easily it can be used across platforms without changes. Clearly, in the domain of multi-core where platform changes can require significant software changes, the concept of generality is an important one. Unfortunately, increasing the generality of the software also negatively impacts both the performance of the software and the productivity of those developing the software (McKenney, 2010). For applications which are pushing the performance curve in the embedded space, a highly modular and abstracted architecture may not be an option. Legacy Code / Re-Use Determining the right partitioning strategy for a multi-core application is difficult to begin with. The problem gets increasingly difficult as the amount of reuse and legacy code increases. Alex Bachmutsky, a chief architect at Nokia Siemens Networks recently commented: "one of ourproblems is how to parallelize existingprograms with eight or 10 million lines of code--you can rewrite everything but it will cost a lot of money." (2011) Modular software can require less work to migrate to a multi-core architecture when its modular decomposition lends itself well to concurrent execution (which isn't always the case) (Rajan, et al., 2010). Existing embedded applications however, tend to be more integral by nature for several reasons. First, code developed by small software teams tends to be integral in nature due to the fact that there is high level of communication between team members. Alan MacCormack's research on Linux and Mozilla suggests that software development occurring at a single location where engineers can easily communicate, results in more integral designs. A single team can solve problems face-to-face and can easily coordinate changes to the architecture of the software in order to improve performance. These types of interactions naturally lead to tighter coupling between components in the system (MacCormack, et al., 2006). This research has implications for firms with legacy code that they wish to migrate to a multi-core processor. Given that most embedded projects have just a handful of software engineers (UBM/EE Times Group, 2010), this research also suggests that more likely than not, these existing software designs are more integral in nature because they're developed by a small Page 59 of 106 number of engineers, a majority of whom are likely co-located. If this were the case, these organizations would have more difficult challenges in order to perform the migration. Second, software applications have a tendency to "decay" over time due to changes made to the system within which the software is operating and also due to incremental feature changes. This decay manifests itself as a breakdown of modularity, increased effort to make changes, increased fault potential and increase in the span of required changes (Eick, et al., 2001). Finally, many embedded applications are written with performance and real-time processing constraints (UBM/EE-Times, 2011) which may require a less modular design (McKenney, 2010). Integral applications can be the most difficult to repartition because they are not clearly partitioned to begin with. Embedded software developers may need to perform a substantial amount of work decomposing and rearchitecting the software. It some cases, it may be easier to start over rather than trying to rework the existing legacy code. A majority of embedded software still targets single-core processors (UBM/EE Times Group, 2010). This implies that a lot of existing legacy code was originally written for a single-core processor. As we will discuss in the next section, software architected for a single core may encounter new failure modes in a multi-core system and this code may need to be reworked to run reliably on a multi-core architecture. If the code was sourced externally, the firm may have access only to the compiled object code and not the source code which makes porting and debugging that code more difficult, particularly when the firm who developed the code is unable or unwilling to support or update the code. If there is existing code that has not been reworked for multi-core processor, large-scale locks may need to be placed in the system around these sections of existing code to protect against these new failure modes. A prime example of such a situation can be seen in the history of the Linux operating system. When SMP support was first added to the Linux kernel, there were several pieces of existing code that were subject to new and unfamiliar types of failure modes not previously possible on a single core processor. To protect against these, a Big Kernel Lock (BKL) was implemented as a crude spin lock to ensure only one core was running in kernel mode and thus halt any concurrent operations in other cores that could lead to these new concurrent failure modes. Subsequently, several modules within the Linux kernel have been Page 60 of 106 modified to support finer grained locking and the BKL is used to support old code (Bovet, et al., 2006). Dynamics lInteractions This section will explore some of the new challenges related to types of interactions that are possible in multi-core systems. Data Races A data race is one of the most common bugs found in applications running on multi-core applications (Gove, 2011). These occur when tasks running on two different cores attempt to modify the same piece of data in memory at the same time. In a threaded application running on a single core, this type of failure isn't possible because at one time, only one thread is accessing memory. However, on a multi-core system, it is now possible for two threads/tasks to execute simultaneously and thus it's possible for them to simultaneously access memory and other system resources. Shared Resources Consider a situation in which tasks running on the various cores of a multi-core processor all require access to external memory, but the memory interface and/or bus structure of the chip has limited bandwidth. As more cores are added, the tasks running on those cores will need to wait longer for external memory accesses to complete, and thus the overall system performance wouldn't scale linearly as more cores were added. Studies have shown that memory bandwidth and the memory management schemes are perhaps the strongest limiting factors in the performance of multi-core systems. The data shown in Figure 28 is from a study that shows how multi-core performance begins to degrade after 8 cores and by 16 cores, the performance is on par again with a dual core device (Sandia Labs, 2009). Page 61 of 106 t-ORA 10H4(DRAM03 .0M 3 ' RU20642Stcxd A .41 0.025 002 0,015 0.01 .. 0 2 4 0 10 CW Nads NumbeWof 32 64 Figure 28: Multi-core performance vs. number ofcores" The example below demonstrates a similar phenomenon. Three processor cores are running independent threads that are sharing a common memory bus and cache. As the number of cores increases, the performance of each core decreases through the use of shared resources. In this case, cache thrashing and memory pipelining slow down the execution rates of the threads on each core (Uhrig, 2009). 1.00 -- SingleThread Perfomance Los -- Ov.ra Performanceain U *1.04 E t.03 1,01 1 2 3 Processorm Figure29: Performance impact as afunction of cores (2-3 threadsper core)48 System developers not only need to understand how to partition their application across several cores, but they also need to think about memory performance. Bus and memory bottlenecks like the ones described above can have adverse impacts on application performance if not understood and accommodated in the overall system design. 47 (Sandia Labs, 41 (Uhrig, 2009) 2009) Page 62 of 106 Synchronization and Inter-Core Communication The amount of interaction between code modules running on different cores increases the challenges of partitioning the application for several reasons. First, the cost of interactions between cores in multi-core systems is typically higher because of the performance of shared memory and the requirements for synchronization (McKenney, 2010). As we increase the number of threads in a system that are working together on the same problem, the number of synchronization events between these threads will likely increase as well. The cost of synchronization can begin eliminating performance gains from increasing the number of threads as shown in Figure 30(Gove, 2011). Speed up relaive to serial case 3- ...... aa __ _________ 12 0%Pard .... 70% Parape ..... ..... S 0 I II 2 4 6 8 I I | 10 12 14 16 I Number of threads Figure 30: Thread scalingwith exaggeratedsynchronization overhead9 Cache Thrashing Another problem is cache thrashing. Cache thrashing can occur when the data that the core is operating on is larger than the cache and data needs to be regularly loaded from slower L2 or external memory as shown in Figure 14 on page 44. If the operating system migrates a thread from one core to the other, it loses its datalocality, or the fact that the instructions and data related to that thread which were cached locally in the other core (QNX, 2010). Marcus Levy of the Multi-core Associated recently noted, "Some of the telecoms OEMs are really struggling because they are finding in the shift from using two single-core chips to one multi-core chip performance is going down. That's because they now have to share resources like caches."5 0 49 50 (Gove, 2011) http://www.eetimes.com/electronics-news/4076675/Groups-debut-multi-core-API-benchmarks Page 63 of 106 There are several other technical challenges that are emergent in nature but this section has provided a summary of the key challenges. System Optimization and Debug Because multi-core systems are can be less predictable than single core design, system optimization is often an essential step of a multi-core design. Furthermore, system debug is a notoriously painful process in embedded system design and for multi-core processors, it can be even more difficult because the tools aren't as mature, interoperability between the tools is weak and the types of bugs that emerge from interactions can be especially tenacious and difficult to track down. Predictability The static analysis techniques that are used in single core designs, particularly for safetycritical systems with hard real time constraints, are significantly more difficult on multi-core devices and in some cases, are infeasible. The difficulty originates from the new forms of interactions across shared resources. As several cores access a shared resource, it's extremely difficult to capture all potential sequences of accesses to that resource. Each access could change the state of the resource and different sequences of access may take different amounts of time to complete (Cullmann, et al., 2010). The unpredictable nature of multi-core design is challenging at the system level because unanticipated interactions can degrade system performance to a point where a single-core device offers more performance than several cores in a multi-core device, as we saw in the previous two sections. However, for embedded systems with real time requirements, the inability to guarantee that certain operations will always complete within a set amount of time may be unacceptable. Svstem Tunina and Debua System tuning and debug are already time consuming tasks in an embedded design and only get more difficult with multi-core. As noted above, there is a certain level of unpredictability inherent in multi-core design due to the complexity of possible interactions over shared resources. Tuning strategies tend to be coupled to the architecture. For example, consider the situation in which 2 cores are working together on a processing task. If it takes 100 cycles to synchronize with the other core (a function of the processor architecture and my synchronization mechanisms), it may be possible to pipeline the algorithm such that it can perform computations Page 64 of 106 for those 100 cycles and thus don't waste any processing time. If this application were ported to another dual core architecture that had different memory performance and clock speed, and the number of clock cycles to perform synchronization changed, the application would no longer be optimally pipelined and would spend some time waiting for either the synchronization or the compute operation to complete. Tracking down the sources of race conditions and dead locks on multi-core processors can be very difficult because these events may be sporadic and when they do occur, are dependent on the state of several processors not just one. Different problem solving techniques are often needed to effectively debug these new types of problems. Diversity and Quality of Tools and Methodologies The most time-consuming part of embedded designing is testing and debugging the system. If embedded engineers could improve one thing about an embedded design, they would likely improve the debugging tools (UBM/EE-Times, 2011). As noted earlier, debugging single core designs is challenging and because there is a great deal of variety in the debug features in processors and development tools, a design team typically develops a set of problem solving techniques around the functionality of those tools. There are several new challenges in developing tools for multi-core system development and this is reflected in the state of tools presently available for embedded multi-core processors. In the general purpose space which has a small number of processor architectures and operating systems, tools have become fairly sophisticated. Windows developers have access to a broad range of tools, libraries and resources provided by Microsoft, Intel and several third parties, for authoring Windows applications to run on Multi-core processors. Microsoft's Visual Studio 2010 ships with a parallel performance tuning tool51 and Intel has their Parallel Composer that works with Visual Studio to provide compilers, libraries and debugger support for multi-core processors5. In the embedded space, different suppliers employ different strategies to provide tools and the landscape is far more fragmented. Some suppliers leverage an open source tool chains like s' http://www.microsoft.com/downloads/en/details.aspx'?FamilyID=8FFC2984-AO5C-4377-8C699A8BOD2B5D 15 52 http://www.informationweek.com/news/hardware/processors/217700703 Page 65 of 106 GCC and debug environments like GDB/Eclipse while other suppliers develop and maintain proprietary tool chains. This has a few important implications. Just like processor architecture, tool chains need to be learned and understood. Different tools have different features that allow users to solve problems in different ways. Therefore, there is some degree of tacit knowledge that is built around not only processor architecture but also the corresponding tool chain. Migrating to a new processor that is similar in architecture but different in tools support, may require a similar amount of effort to both learn how to code for performance and debug system issues with different types of debug features and capabilities. In a recent panel discussion as part of the EE Times "Approaching Multi-core" virtual conference, several industry pundits commented on the state of multi-core tools. Because of the unpredictable nature of applications running on multi-core processors, simulation and modeling tools have become increasingly important to help guide partitioning decisions and predict performance. Tasneem Brutch13 recently noted that there are several new types of debugging, analysis, development, profiling and auto-tuning tools that are used for multi-core design and that almost every multi-core application requires a unique configuration of tools. There is also very little interoperability between these tools which makes analysis difficult (2010). Tom Starnes 54 noted that accurately simulating and modeling the performance of multi-core systems is very difficult, particularly because the interactions between the cores become very complicated. In addition to simulating the cores themselves, simulator tools need to be able to accurately model cache, cross-bars, memory, 1/0, interrupts as well (2010). Adoption Scenarios This section will examine four different adoption scenarios. We will explore the nature of the architectural change to determine architectural factors that can impede or expedite adoption of multi-core devices. The following section will then explore additional factors and mechanisms which affect adoption in the adopting organization, the surrounding ecosystem and beyond. 1 Tasneem Brutch is a senior staff engineer at Samsung Research and also setup and chaired the Tool Infrastructure Work Group in the Multi-core Association 54 Tom Stames is a Processor Analyst at Objective Analysis Page 66 of 106 Current Multi-core Adoption Patterns Multiprocessor designs are common in embedded systems. According to EE Times 2010 Embedded Market Study, only 50% of projects in 2010 used a single processor. The average number of processors in a project in 2010 was 2.6 and this number hasn't changed dramatically in the last 4 years (UBM/EE Times Group, 2010). Because the functions of an embedded system are well understood at design time, various computing tasks can be placed on different types of embedded processors that may be suited for the task. Despite the benefits of multi-core devices discussed earlier, the transition to multi-core has been happening slowly across the industry. In 2010, 15% of designs using multiple cores used a multi-core device, up from 9% in 2007 (UBM/EE Times Group, 2010). In some areas, adoption is rapidly outpacing this industry adoption rate of 6%. Case 1: Existing Multiprocessor Hardware Design Pattern The first design scenario is porting a multiprocessor design to a multi-core processor that mirrors the existing multiprocessor system architecture. In this scenario, the class of processors within the multi-core device and the means by which they are communicate with each other and the rest of the system, remain mostly intact. We essentially have an existing architectural pattern that is being migrated to the multi-core design. As we will see, this scenario represents one of the easier cases because most of the knowledge assets associated with the architecture can be persevered. However, as we'll see, the degree of difficulty is heavily impacted by the extent to which new types of interactions are possible and occurring. This scenario can be commonly seen in two forms. The first form is an existing system utilizing a heterogeneous collection of multi-core processors that is migrated to a heterogeneous multi-core device as shown in Figure 31 below. A E C Figure 31: Case 1, heterogeneous architecture Page 67 of 106 A common application example of this scenario can be seen in the DSP + microcontroller architectural pattern that, from the author's professional experience, has become a common system architecture over the last several decades. This pattern can be seen in several applications such as handsets, surveillance cameras, consumer multimedia products. In this system architecture, a microprocessor (typically an ARM) is running an operating system, managing communication and running applications. The DSP is typically running computationally intensive algorithms specific to the application. In a handset, the DSP is running the digital modulation and demodulation algorithms. In an IP camera, the DSP is running video encoding and analytics algorithms. In the last decade, a class of heterogeneous processors has emerged which contains both the DSP and the microprocessor like the Texas Instruments OMAP family of devices55 . In the second form, an existing system utilizing a homogeneous collection of multi-core processors is migrated to a multi-core device as shown in Figure 32 below. A Figure 32: Case 1, homogeneous architecture Homogeneous arrays of discrete processors are commonly used in performance-driven applications where the processing is relatively homogeneous in nature and can be divided between a homogeneous collection of processors. A common application example of this configuration can be seen in wireless communications applications where several DSPs may be required for baseband processing. The Texas Instruments TMS320C6474 tri-core DSP is optimized for communications processing and utilizes three C64+ DSP cores. It is designed to be a multi-core migration path for multiprocessor designs built around the single-core TMS320C6455 which also utilizes the C64+ DSP cores6 ss Texas Instruments OMAP Processors: http://focus.ti.com/general/docs/wtbu/wtbugencontent.tsp?templateld=6123&navigationld=l 1988&contentld=4638 56 TMS320C6455 to TMS320C6474 Migration Guide - http://focus.ti.com/lit/an/spraav8/spraav8.pdf Page 68 of 106 New Interactions In some cases, multiple processor cores are integrated into a multi-core device but the manner in which they connect to each other and the surrounding devices is almost identical to the multiprocessor architecture. II - A L B - Figure33: Case 1, no new resource sharing An example of this type of architecture can be seen in the ADSP-14060 DSP, which is a multichip module containing 4 ADSP-21060 processors. The processors are connected within the ADSP-14060 exactly as they would be in a multiprocessing configuration utilizing four ADSP-21060 devices. In the author's experience, migrating an existing software architecture built for the equivalent ADSP-21060 multiprocessor system to this multi-core device is typically a trivial process. FUNCTIONAL BLOCK DIAGRAM AD14es4moLe Figure 34: ADSP-14060 QuadprocessorDSP57 If the multi-core device does not introduce any new possible forms of interaction between the existing system software and the hardware, the architectural change required is minimal and the migration from a multiprocessor to a multi-core design can happen very quickly. " http://www.analog.com/static/imported-files/data sheets/ADD14060 1 4060L.pdif Page 69 of 106 Most multi-core processors however, do utilize shared resources and this can introduce some new types of interaction between the two cores that weren't possible in the previous multiprocessor system architecture. A B Figure 35: Case 2, new resource sharing The tri-core TI C6474 is designed to replace three single-core C6455 DSPs. The three cores however, now need to share several resources, which wouldn't have been shared in a multiprocessor architecture. As shown in Figure 36 below, all three cores now share a common memory bus. The external memory bus on the tri core device is slightly faster running at 667MHz versus the memory bus on the single core device which ran at 533MHz; but this memory bus is now being shared by 3 cores. The L2 cache memory is also smaller. The singlecore C6455 has 2MB of L2 cache while the C6474 has 3MB of L2 cache utilized by 3 cores58. L *low Figure 36: C6474 Block Diagram" An existing system design that was based on three C6455 DSPs may behave very differently once ported to a single C6474 device because of the increased sharing and contention. All three cores are sharing a common memory bus and each core, on average, has half the amount of L2 cache. If the existing applications running on the C6455s were not memory intensive, the rate at which the software interacts with these shared memory resources will be lower and may not significantly change the behavior of the system. However, if the C6455s were running memory TMS320C6455 to TMS320C6474 Migration Guide - http://focus.ti.com/lit/an/spraav8/spraav8.pdf 59 http://www.eetimes.com/electronics-products/embedded-tools/4108275/Multi-core-DSP-offers-3-GHznerformance-at- I -GHz-Drice 5 Page 70 of 106 intensive applications, the level of interaction with the shared resources will be high. Contention on the external memory bus and the smaller L2 cache may have more profound effects on the system performance. The difficulty of adoption of multi-core devices in this scenario is heavily impacted by the nature of new interactions that are possible in the new multi-core architecture and the frequency with which these interactions occur. System performance may degrade as a result of these interactions and the amount of architectural change that may be required to mitigate these degradations will increase as the rate of interactions increase. Examining this transition through Kim & Henderson's framework on innovation, a great deal of existing component and architectural knowledge assets are preserved through the migration. At the processor level, this innovation could be characterized as 'incremental' if no new interactions emerged or 'architectural' if new interactions emerged that required new knowledge to address. Shared resources allow new interactions between components of the system that previously did not directly interact; meaning that existing architecture knowledge may be destroyed and new architectural knowledge of these interactions may be required to successfully complete a design. In summary, migrating to a multi-core processor whose architecture mirrors the existing system architecture as discussed in the first scenario, is one of the easier cases because often times, the firm's existing knowledge assets can be preserved. However, as new resource sharing requirements are placed on the software running on each core, the shift becomes more architectural in nature which can require new architectural knowledge and problem solving techniques. Case 2: Existing Software Design Pattern The second migration scenario is a transition from a single core to a multi-core SMP architecture where existing applications are built on top of a POSIX-compliant multi-threaded operating system with SMP support. A a a Figure 37: Case 2, Symmetric Multiprocessing Page 71 of 106 The combination of POSIX compliance and SMP in an operating system provides an abstraction layer that, in theory, allows a series of applications developed against this software design pattern to be migrated to a multi-core processor without fundamentally changing the architecture as shown in Figure 38. The operating system or the user can allocate threads and processes to different cores and the system runs in a similar manner to a multi-threaded application on a single-core processor. POSIXCompliant Applcation POSIXApplication POSIXCompliant Compliant Application POSIXCompliant Application POSIXan Application POSIXCompliant Application Operating System Operating System POSIX-Compliant with SMP.Support POSIX-Compliant CPU CPU CPU System Interconnect System Interconnect Memory Cntroller I/E/o IOontrol:er] e or 60 Figure 38: Migrationfrom single-core to dual-core with SMP & POSIX In practice, there are several subtle issues noted earlier in the section on Dynamics / Interactions Error!Reference source not found. that can dramatically impact the ease of migration and the number of architectural changes required for the system to run reliably and at a higher level of performance over the single core implementation. However, these issues can be mitigated by keeping processes and threads pinned to individual cores. This is known as 'process affinity' and 'thread affinity' (Gove, 2011) (QNX, 2010). While this prevents the operating system from allocating threads to different cores for load balancing purposes, it does preserve the single-core execution model for each process. This can be used as a stepping-stone to getting an existing application running in full SMP mode on a multi-core processor. A segment where this type of transition has been happening rapidly is in the smart phone and tablet markets. In 2011, 15% of smart phones are expected to ship with a multi-core processor and this number is expected to rise to 45% by 2015 (Strategy Analytucs, 2011). Many of the major tablet manufacturers are also moving to multi-core in 2011. The Apple iPad2 utilizes the Apple A5 dual-core processor; LG, Samsung and Motorola are developing tablets around 60 Adapted from a diagram n htim//www.iomagazineonIine.com/IQ/IO I7/LowRes.pdfs/Pgs34-37 1017xd Page 72 of 106 NVidia's dual-core Tegra2; the Research in Motion (RIM) Playbook utilizes the Texas Instruments dual-core A9 Omap 4430 (EE-Times, 2011). The operating system on Apple's iPad 2 is iOS which is an SMP, POSIX-compliant operating system6 1. RIM's Playbook runs QNX which is an SMP, POSIX-compliant operating system 62. The Xoom from Motorola runs Android which is built upon Linux, a mostly POSIXcompliant, SMP OS 63 . ARM and ST-Ericsson have been working to improve SMP support in the Android kernel (ARM, 2010). Most of these tablet manufacturers haven't fully opened up multi-core programming to the application develop community and are taking an incremental approach. For example, Android applications are written against Dalvik, which is the process virtual machine in Android. Dalvik processes currently run on only one core of a multi-core system which, like process affinity, helps protect against certain software-software and software-hardware interactions that can hinder performance and reliability (Moynihan, 2011). The lack of determinism in multithreaded applications can be compounded by resource sharing and cache thrashing scenarios. Developing applications against the POSIX/SMP design patterns with real-time requirements can be significantly more difficult in a multi-core situation because of the unpredictable nature of the new interactions (Cullmann, et al., 2010). Similarly, if the application is performance driven, pieces of the system may need to be rearchitected to fine tune the synchronization mechanisms, cache utilization and shared resource utilization. In summary, following the multi-threaded POSIX-compliant application design, patterns may allow for the preservation of architecture and architectural knowledge. There are still several challenges that can be encountered when moving from a multi-threaded software architecture noted in the previous sections and which require architectural changes and new problem solving techniques; however, through utilization of task and process-affinity techniques, a developer can take incremental steps to a full SMP system by manually pinning processes and threads to individual cores and reduce the number of possible interactions. 61 http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Multithreadiniz/CreatinjzThreads/Creatin gThreads.html 62 http://www.engadget.com/2010/09/27/rim-introduces-playbook-the-blackberry-tablet/ 63 http://www.motorola.com/staticfiles/Consumers/xoom-android-tablet/us-en/overview.html Page 73 of 106 Case 3: No Existing Patterns with Legacy Code The third design scenario is one in which no existing design pattern exists. In this situation, an existing single-core design is being migrated to either a homogeneous or heterogeneous multicore processor. B A C AB C Figure 39: Case 3, Migrationto homogeneous and heterogeneousscenarios Unlike case 1 and case 2, there are no existing design patterns (hardware or software) that can be used to facilitate migration to this platform. Furthermore, this scenario involves legacy code which, as noted earlier, may be more integral in nature and more difficult to partition. An example of this scenario can be seen in performance driven applications where single core processors no longer offer enough processing power and the only viable solution is to add more cores. The existing application needs to be re-partioned and re-architected so it can be distributed across the cores. From here, there are several challenges with respect to the dynamics of the application as noted above. If the existing single-core application is modular and the decomposition aligns with a viable partitioning strategy, the level of architectural changes may be reduced. However, in most circumstances, this type of migration affects both component and architectural knowledge and thus is radical in nature. Adoption in these scenarios will be costly, adoption rates will be slow and penetration will lag, as discussed in the Legacy Code / Re-Use section above and the cost and technical difficulty of adoption may mitigate any positive benefit that a multi-core architecture could offer. Case 4: No Existing Patterns with New Code Case 4 is very similar to Case 3 but we are dealing with a new software architecture that can be architected to reflect the processor architecture it will run on. Page 74 of 106 Figure 40: Case 4, New development to homogeneous SMP, homogeneous AMP or heterogeneous MC The existence of legacy code can impact the difficulty of adoption, particularly if this legacy code hasn't been developed to run in a multi-core or multithreaded environment. Furthermore, if the legacy code was sourced externally, it may be difficult to understand the type of interactions possible without access to source code which may add increased risk to the project. Summary ofAdoption Scenarios If an existing architectural design pattern exists, either in the form of hardware and software that we explored in case 1 or in the case of software over an SMP abstraction later as we explored in case 2, several elements of the architecture can be preserved in the migration from signal core to multi-core which also means that existing knowledge assets and problem solving techniques can be applied as well. There are factors in each scenario that we explored which can impact the degree of new architectural knowledge that is required for a successful transition (system functionality and increased performance on the desired parameter); however, the scale of the new architectural knowledge is small compared to cases 3 and 4. As such, we have observed that the transition to multi-core in these first two cases has been easier for industry and adoption can happen broadly across one generation of products as we saw in tablets. If no architectural pattern exists, it is possible that a new architecture will be required and with it, new architectural knowledge and problem solving skills. Modularity in legacy code that aligns with partitioning strategies can reduce the extent to which the architecture needs to change. Figure 41 on the next page provides a summary of the four adoption scenarios covered in this section. Page 75 of 106 Type of Design New Design %v aw namaeu fC A$ ___E ting Design o*w 4 Exisn H O0 E New Hardware Architecture Existing Hardware Architecture / Pattern j.~1 haVtaao A B aCc's ramam new VC? B -']E -Ir Existing Software A ,,,,- . Architecture/Pattern w*.E]A] FA Figure 41: Multi-core Adoption Scenarios Page 76 of 106 8. System Level Factors and Challenges In the previous section, we examined technical and architectural factors that can significantly impact the difficulty of migrating to a multi-core processor. We explored several scenarios which varied in the degree of difficulty that were ultimately impacted by the availability of existing design patterns, the existence of new types of interactions that were not present in the existing design, partitionability of the application, legacy code and code sourcing strategies. This next section explores several non-technical factors that impact multi-core adoption. It examines the organizational and managerial factors and dynamics within a firm adopting a multicore architecture. And from here, examines the factors and dynamics surrounding the firm such as the suppliers and value chain. Finally, it examines factors and dynamics related to human cognition and behavior that affect adoption. Figure42: Layers ofAdoption Factors& Dynamics Factors and Dynamics within an Adopting Firm Beyond the technical and architectural elements, there are several other factors that may influence a firm looking to adopt an embedded multi-core processor. Diversity of Platforms & Design Space Mobility Processors have certain characteristics that can be used to differentiate them from other processors within a given product class. For example, the operating frequency and instructionset architecture of an embedded processor are two key characteristics that can be used to delineate several subclasses of processors. Other defining characteristics of embedded Page 77 of 106 processors include the memory size, the memory architecture (cache, shared memory, performance), and the 1/0 peripherals. Baldwin and Clark define an artifact's 'design space' as the total number of designs that can be constructed using all possible combinations along these dimensions (2000). An important characteristic of single core embedded processors is that while they have a relatively large design space, the process of moving an application written in C from one processor to another, generally allows the architecture of the software to be preserved, particularly when moving to a processor with more memory and performance. A certain amount of tuning is typically necessary but applications don't necessarily need to be rewritten from scratch in this scenario. While the design space for single-core embedded processors is quite large, migrating between locations within the design space doesn't necessarily require rearchitecting the software. With multi-core processors, a few additional key dimensions are added to the design space: the number of cores, the types of cores and the means by which the cores communicate. A software architecture that has been optimized and partitioned for two cores, for example, will likely have a different architecture if the application were split across four cores. When all of the cores are identical, migrating to a multi-core ISA with a different type of processor, but with similar fundamental capabilities like ARM to MIPS may not require a great deal of rearchitecting. However migrating to another multi-core which uses a heterogeneous collection of cores with differing capabilities, may require a fundamental rearchitecting of the system because the performance of the software may change more dramatically on a different ISA. Finally, different memory structures and latencies may mean that a certain partitioning and tuning strategy for an N-core architecture may run very differently on another N-core architecture with similar core types. Much of the system may need to be rearchitected when migrating to a multi-core device. And more importantly, it may need to be rearchitected again, moving to a device with a different number of cores. In the past, embedded developers could migrate around the existing design space by keeping their applications in C and typically making modular enhancements to their applications. With multi-core, migrating to new nodes in the design space will typically be more architectural and radical by nature. Page 78 of 106 Platforming Limitations Consider an end-product platforming strategy based on single core devices. A low end product could be based on a slower processor running a subset of features; a more expensive product could be based on a faster processor with more software features. Because of the decreased mobility in the design space, adding more cores can be significantly less trivial. A low-end product based on a dual-core processor and a high-end product based on a four core processor without using SMP will likely require two different software architectures. If SMP is used, it's possible that the performance increase gained from moving from two cores to four cores is not significant due to the scaling limitations of SMP noted earlier. Platform Flexibility. Coupling &Second Sourcing Having a viable second source for components within a system design is a commonly used risk mitigation strategy in the electronics industry. A product's success is dependent on the ability of the suppliers to deliver working components on time and at the agreed upon price. If a supplier fails in any of these regards, it can jeopardize the entire product. Second sourcing is typically easier for simpler, more standard components like resistors and capacitors because there are several suppliers who make interchangeable products. However, second sourcing gets increasingly difficult for custom-designed and complex components that may be unique to a single supplier. Processors are particularly unique in this regard because there are several other forms of coupling that exist. The software developed for that processor may also require a custom architecture if the processor's design is a departure from the more dominant design patterns like a many-core processor. Also, there are also several knowledge assets within the firm which are coupled to a processor. Engineers have learned the architecture and more importantly they've learned the design tools, ecosystem partners and how to solve problems. Switching to another processor vendor who offers similar processor architecture with different tools and a different third party network can still require a significant switching cost. Migrating between the myriad of new ARM multi-core SMP processors can be almost trivial if you're running an SMP OS like Linux where most of the ARM tools and ecosystem are available regardless of vendor. However, migrating between other multi-core processors that don't have a similar pervasiveness is far more difficult. The cost of migrating an architecture Page 79 of 106 and a firm's knowledge assets to this platform will be high and more importantly, the costs of then switching to a different architecture may be equally high. Platform Evaluatability and Learnability An adopting firm will likely want to assess the performance of a new processing platform before they commit to adoption. This process typically involves taking existing algorithms, porting them to the new processor platform and evaluating the execution speed and efficiency. The difficulty of this process for the adopting firm can significantly impact the likelihood of adoption. In 2007, the author was visiting a large video surveillance company that was evaluating digital signal processors for a next generation video analytics platform that could intelligently identify objects and people. The firm had a large amount of existing C code that was implemented as a single-threaded application and was highly integral. The firm was evaluating a single core DSP from one firm and a dual-core DSP from our firm. The processors had similar computational capabilities but the key difference is that the competing firm offered these capabilities on a single core where we had two cores with half the computational capabilities that, when summed, were equivalent to the single core DSP. The firm wanted to evaluate the performance of their existing application on both processors and the evaluation report that came back weeks later showed that the dual core processor had achieved half the performance of the single core processor because the firm evaluating the platforms was not willing to reengineer their code to run on two cores just to evaluate a device. Itwas far easier for them to use the single core DSP from the competing firm because they were able to quickly assess the type of performance that could be achieved with this platform. Radical architectures that are able to still provide a gentle slope evaluation, allowing adopting firms to make short term progress and incrementally achieve full performance, can be a key to success since adopting firms may not be willing to invest a great deal of time to climb a cliff before they can begin to assess the likely performance that the platform will offer. Hiring Software Engineers who know Multi-core When it comes to multi-core development, most embedded engineers are new to the domain. VDC Research's 2010 Embedded Engineering Survey found that slightly more than half of embedded engineers have some level of experience with multi-core or multiprocessor designs as shown in Figure 43. And half of those with experience have less than 2 years of experience. A Page 80 of 106 majority of those are developing systems that have historically used multi-core or multiprocessor devices like networking, mobile phones, consumer electronics, and telecom and data communications systems (VDC Research, 2011). > than 4 years 17% 1 - 2years 8% 6 -12 months 8% 6 months or less 12% Figure43: WW Respondents Working with/ProgrammingMulti-core and/orMultiprocessorDesigns For a firm embarking on a new multi-core design, they need to decide whether to bring in outside help in the form of contractors or new hires who have the right experience level, or essentially try to learn as they go. Non-Deterministic Development Cycles & Costs Because performance of multi-core systems is unpredictable (as discussed earlier in the Task Parallelism section) and the new types of bugs that emerge can be particularly tenacious, software development cycles around multi-core processors can be unpredictable themselves, particularly for firms which haven't developed the knowledge and problem solving skills around the technology. A Whitepaper from VX Works cites a VDC research report that 25% of multicore projects miss deadlines and functional targets by at least 50%6. Multi-core technology may not be compatible with tight product deadlines, particularly when the development team is new to multi-core. Technology Tradeoffs &Alternatives If migrating to a multi-core processor is going to both destroy and require the build-up of several knowledge assets within a firm, they may take a longer and deeper look at alternative technologies like field programmable gate arrays (FPGAs) and hardware acceleration. 6 http://www.mistralsolutions.com/newsletter/ianl 1/servefile.pdf Page 81 of 106 As Jeff Bier of BDTI noted: "If system companies arefaced with sigificantswitching costs due to their chip vendors switching processorarchitectures,it's likely they're going to take the opportunity to see what competing chip companies have to offer. "65 There are new classes of FPGAs emerging, for example, with cost structures that will be extremely competitive. The Xilinx Zynq-7000 'Extensible Processing Platform' features a dualcore ARM Cortex-A9 surrounded by 235K logic gates shown in Figure 44. This logic can be configured to run the computationally intensive tasks and control code can run within an SMP OS on the ARM Cortex A9s selling for less than $15 at high volume66 Figure 44: Xilinx Zynq-7000 Extensible ProcessingPlatform Processors and FPGAs are designed to solve different types of problems. While there is a great deal of overlap in the kinds of problems that can be efficiently solved with both platforms, there will always be cases where a processor is a better choice like running an operating system or executing control code and where an FPGA will be a better choice like running highperformance, fixed-point computational tasks so while FPGAs may look attractive, a multi-core processor may still be a far more optimal solution and if it is, a firms competitors may be taking that route in which case, they may have no choice but explore multi-core as well. 65 http://www.bdti.com/InsideDSP/2010/09/02/JeffBierImpulseRespfonse 6 http://low-powerdesign.com/sleibson/2011/03/01 /xilinx-zynq-epps-create-a-new-category-that-fits-in-amongsocs-fpgas-and-microcontrollers/ Page 82 of 106 Another technology alternative is hardware accelerators. A hardware accelerator is an implementation of a common system-level function that has been implemented as digital logic within a processor design rather than as software that runs on the processors. Hardware accelerators save embedded engineers a great deal of time because they don't need to source this function from a third party or author this function from scratch. Most hardware accelerators target specific functions that have become standardized in the industry. For example, the H.264 digital video compression standard has stabilized over the last decade and many embedded processors targeting certain applications will typically have a hardware accelerator that performs the H.264 video compression (encoding) or decompression (decoding) so the processor doesn't have to do this in software. For example, some variants of the Texas Instruments' DaVinci processor which target video surveillance, include a hardware accelerator that performs H.264 video encoding, 67. The NVidia Tegra 2 processor shown in Figure 45 below and which powers the Motorola Xoom includes several accelerators for video encoding and decoding, audio processing, 2D and 3D graphics and image processing. Figure45: Nvidia Tegra 2 Processor Another benefit of hardware accelerators is that they typically can deliver a much higher level of performance and use significantly less power than the functions written in software might. An EE-Times study in 2008 found that accelerators offered a 200x performance increase over a software implementation at the same power dissipation level or a 90% power reduction at the same performance level (Frazer, 2008). 67 http://www.security-technologynewscom/article/h264-encoder-digital-media-processors.html Page 83 of 106 LIL OPU*AW 0*64"B at* MN0g 1 I WQ Figure46: PerformanceandPowerGains with HardwareAcceleration (Frazer,2008) Hardware accelerators are becoming much more common on embedded processors for two key reasons. First, they have become less expensive to implement. Years ago, a processor vendor would need to have some very large customers to justify the resources to add an accelerator, but today, a smaller subset of customers can justify adding an accelerator because gate costs at more advanced processing nodes are significantly less expensive as seen in Figure 47. -- 130nm son --- 0.1 0.01 0.001 0.01 1 0.1 10 100 1000 MGdsChip Figure 47: Cost per gate by process node (Kumar, 2008) Second, there is an entire industry consisting of more than 200 companies which has emerged providing hardware IP blocks that suppliers can purchase and integrate into their processors (McCanny, 2011). Hardware accelerators have their share of challenges. Suppliers who develop accelerators themselves will need to acquire a great deal of knowledge about the end application to ensure the accelerator is a viable substitute for software that their customers typically develop themselves. Furthermore, accelerators can introduce new forms of complexity into both the architecture and the tool chain. For example, system developers may need to debug accelerator behavior alongside software, but in many cases they're not connected to the hardware debug resources on the chip and tool chains typically can't provide a unified view (Brutch, 2010). Furthermore, accelerators present themselves as black boxes to developers using them, making them more Page 84 of 106 challenging from a system debug perspective since developers don't have visibility into the design and inner workings. As a result, the challenges that multi-core approaches present may lead companies to consider alternative approaches to achieving the required performance, and to do so in particular where they might not otherwise have done so, thereby limiting the adoption of multi-core processors in embedded systems. Organizational Structure and Product Architecture Conway's Law which was named after computer programmer Melvin Conway based on a paper he wrote in 1968 states that an organization will produce designs that mirror the communication structures of these organizations (Conway, 1968). Henderson and Clark suggest that the structure of a dominant design will typically be mirrored in the organizational structures (1990). This dynamic can be very powerful and often times is underestimated by the suppliers providing technologies (such as new multi-core architectures) with great benefits but which also require an architectural change of the firm adopting them. There have been several examples of new technologies which offered major technological benefits but for which companies weren't organizationally in a position to leverage the technology. From the author's experience, a highly relevant example is the Blackfin Embedded Processor from Analog Devices. This processor was co-developed by Analog Devices and Intel and launched in 2000. The intent of the design was to displace both a DSP and a microcontroller in an existing system by combining the salient attributes of these processors into a single processor architecture. The architecture was named the 'MicroSignal Architecture' 68. Despite the cost and power benefits of removing a DSP and MCU from a design and combining these functions into a single Blackfin, we found that firm's existing organizational structures were the main impediment to adoption. First, these firms typically had two different teams developing the software for the DSP and MCU; and second, these teams had very different skill sets. The team programming the DSP typically consisted of electrical engineers and mathematicians while the team writing the control code on the MCU consisted of computer scientists. And while these two teams interacted, they were two separate teams. Requiring that these two teams architect a single piece of software that maintained the throughput and real-time requirements that the DSP engineers needed to maintain and also supported the large OS that the MCU engineers needed to 68 http://electronicdesign.com/article/digital/company-alliance-develops-micro-signal-dsp-archite.aspx Page 85 of 106 run, was difficult at several levels. As a result, several firms have adopted Blackfin as a DSP or an MCU, but firms wishing to migrate to an existing DSP+MCU architecture were often not in a position to capitalize on the benefits. The cost and risk associated with reorganizing around a single processor architecture and then trying to meet all of the conflicting DSP and control system requirements didn't warrant the cost and power benefits of the Blackfin. Factors and Dynamics Surrounding a Firm Ecosystem The ecosystem surrounding a processor is extremely important to embedded developers. Within the ecosystem are the development tools, 3 rd parties providing libraries and operating systems for that platform, contractors who are able to assist with product development, training firms who can help organizations build skills around new architectures and many more. There may also be an informal network of other engineers developing products using a processor platform who provide support to each other on public forums69. A strong ecosystem is critical to success for processor vendors because without a development tool chain, for example, there would be no way to program the device. Without a network of 3 rdparties providing libraries, firms adopting the processor would be forced to implement common functions from scratch. If operating systems hadn't been ported to that processor, firms would be either required to port an open source operating system themselves, hire a commercial OS company to port their OS to that platform or develop some form of a proprietary kernel themselves. Not surprisingly, 46% of embedded system engineers said that the quality of the ecosystem surrounding a processor was the most important attribute when selecting a processor scoring higher than the processor itself (43%) and the chip supplier (11%) (UBM/EE-Times, 2011). Because an ecosystem consists of several different components that all serve different roles when developing a product with a processor, ecosystems can also suffer from a weakest link. For example, a processor may have a rich set of options for compilers and debuggers but if that processor is targeting a certain market for which certain algorithms are highly standard and that firm doesn't have a library supplier in the ecosystem, it will likely be at a large disadvantage. Most popular processor architectures have one or more public discussion forums that are not affiliated with the silicon vendors or IP providers. For example, here is an ARM discussion group centered around programming ARM processors with GCC/GNU tools: http://embdev.net/forum/arm-gcc 69 Page 86 of 106 A company like ARM has a massive ecosystem surrounding its processors. As of April 2011, ARM's website listed 655 partners within its ecosystem providing libraries, operating systems, development tools, design services, training and more70 . As processor architecture becomes more popular, it becomes easier to attract participants in the ecosystem because these potential ecosystem participants see larger and more secure opportunities to justify the investment in developing resources for that processor. According to ARM's annual report, 6.1 billion chips shipped in 2010 containing an ARM processor (2010). However, in the author's experience, a company launching a new processor architecture may have a harder time attracting ecosystem players because the future success of that processor isn't yet clear. The suppliers therefore, may need to invest directly into 3 rd parties to seed the ecosystem or do the heavy lifting of creating the components an ecosystem would normally deliver, in house. Some suppliers bringing a new processor to market may be able to leverage an existing ecosystem. For example, a supplier developing an ARM-based processor has an extremely large ecosystem that they can leverage directly. In other cases, firms may also develop new architectures that are designed in a manner that will allow them to at least capitalize on existing design patterns. A processor supplier developing a radical processor architecture that can't leverage any existing ecosystems, will likely need to expend a lot more effort developing an ecosystem from scratch, either within their company or via directing funding to outside 3 rd parties. This is consistent with research around revolutionary technology introductions (Afuah, 2001). In general, delivering revolutionary products requires much greater funding as the development costs are higher since less existing technology is being leveraged; the return on investment will typically be much further out into the future (Golda, et al., 2007). Consider two entrants into the many-core embedded space: Stream Processors, Inc (SPI) and Tilera. Both companies were founded in 2004 by MIT professors (former and current respectively) and offered radical new approaches that had theoretically computation throughputs well beyond what existing embedded processors and DSPs could offer. In 2007, SPI touted a 1Ox increase in cost/performance over existing DSPs with their Strom-I family of processors 71. 70 ARM's Partner Network: http://www.arm.com/community/partners/alI partners.php 71 http://www.eetimes.com/electronics-products/processors/4091602/Startup-touts-stream-processing- architecture-for-DSPs Page 87 of 106 In 2007 Tilera claimed a 40x advantage over a mainstream high-end TI DSP on certain benchmarks with the TILE64 processor 72 SPI's initial Storm-I family of processors shown in Figure 48 utilized a MIPS processor as a system processor and a second MIPS processor to control a series of 16 parallel computation units each with 5 computation sub-systems. The SPI also provided a C compiler, tool chain and some libraries aligned to their target market of video surveillance, video conferencing and multifunction printers73 Figure 48: SPI's High End SPJ6HP-G220ProcessorBlock Diagram Tilera's initial offering was the TILE64 processor which consisted of 64 proprietary cores arranged in a mesh configuration as shown in Figure 49. Tilera was focused on similar markets digital multimedia and networking 7 4 but took a far more general purpose approach to the architecture. Tilera supported Linux via a 64-way SMP implementation back in 2007 around the time of the product's launch. Tilera also claimed that 'unmodified' Linux as SPI initially - applications could be built for the TILE64 using Tilera's tool chain 7 s. While SPI supported Linux on the control processor, it required customers to work at a lower level and program the processing elements directly and explicitly. 72 http://arstechnica.com/hardware/news/2007/8/MIT-startu-raises-multicorebrwithnew-64-coreCPU.ars/2 " SPI's website as of April 2008 http://replay.web.archive.or/20080420122343/http://wwwstreamprocessrs.com/ 74' Tilera' s website as of April 2008 - http://replay. web.archive.org/200804 1221l4 042/http://www.tilera.coml/ " http://www.mm4m.net/library/64 Core Processors.pdf Page 88 of 106 s-fr Figure 49: Tilera's TILE64 ProcessorBlock Diagram Since then, Tilera has three products in mass production 76 and has established a formidable ecosystem around their architectures including 23 3 'dparties listed on their site as of April of 201 1. They promote several design wins on their site78 . As of October 2010, the latest Linux kernel (2.6.26) now supports Tilera's TILETM architecture79 . SPI went out of business in 2009. In an interview with Anant Agarwal, MIT professor and co-founder/CTO of Tilera Corporation, Mr. Agarwal commented that providing a processor platform that not only delivers high performance but is also general purpose have been important components of Tilera's continued success. By providing a more general purpose architecture that supports SMP Linux, standard tools and standard multi-core APIs and libraries like pthreads, OpenMP and TBB, potential adopters are able to leverage existing knowledge, software and problem solving techniques, easing the process of platform evaluation and adoption. The general purpose nature of the architecture also means that it can be used to solve a more diverse set of problems across more markets, and potentially larger markets. He noted that the processors developed by other now defunct many-core suppliers were not general purpose which required them to focus on specific markets and this created certain challenges for these firms. For one, they were more tightly coupled to these market segments and it was difficult to diversify into other segments. Furthermore, because of the lack of standard tools, libraries and operating systems, their customers typically required them to 76http://www.tilera.com/about tilera/testimonials http://www.tilera.com/partners 7 http://www.tilera.com/about tilera/testimonials 79 http://www.tilera.com/about tilera/press-releases/linux-kernel-now-supports-tile-architecture " Page 89 of 106 develop large portions of the applications which placed additional support burdens on these suppliers. (Agarwal, 2011). As noted above, the strength of an ecosystem is generally more important to engineers than the chip itself which ties back to several of the challenges highlighted in the previous sections of this thesis. Companies which introduce new architectures that present a radical departure from dominant design patterns may be required to develop a dedicated ecosystem around that architecture which can be quite expensive. Alternatively, if processor vendors with a radical product concept are able to leverage an existing ecosystem and design pattern, they may be able to reduce the investment in their own ecosystem and appear as a lower risk to suppliers as they're leveraging existing design patterns that may be well understood within firms adopting the technology. Established vs Start-up Behavior Well-established firms may wait until the revolutionary technology gets a foot-hold before making substantial investments (Golda, et al., 2007). We see evidence now of several major companies developing multi-core processors based on ARM Cortex A9 and the follow-on A15 like Marvel, TI, Samsung as that platform has become fairly dominant. Companies like Freescale are developing SMP processors based on their own core technology but which will be able to support the existing SMP ecosystem. This technology presents itself as more evolutionary than revolutionary since existing applications can be more easily ported and then further tuned for performance later. However, we haven't seen major movements into more radical approaches from major suppliers yet. Supplier Mortality Another key adoption dynamic is start-up mortality. Start-ups pursuing radical technologies may require a larger amount of funding because the technology development will be more expensive and it will take longer to get a return on investment. Golda and Philippi describe the 'valley of death' as shown in Figure 50 as a period within the first several years of a firm developing a radical technology in which the rate of spending is high but the rate of earnings is low. However, if the radical technology is widely adopted, the return on investment over time is significant. Page 90 of 106 Figure 50: "Valley of Death"for a revolutionarytechnology (Golda, et al., 2007) Alternatively, suppliers developing more incremental or evolutionary products require a small investment but also may not see as large a potential return over the long run. 15 Smaller, Faster Return 10 5. 10 Smaller -10. Investment Figure 51: "Valley of Death" for an evolutionarytechnology (Golda, et al., 2007) The development costs for a fabless startup developing a new processor architecture are quite high. Jim Turley of Microprocessor Report offered several insightful observations on the business climate for fabless startups. He estimates that a fabless startup developing a new processor will need $100M to survive. Half of this is for product development which will typically take about 2 years from the first design efforts to working silicon. The other half is spent on securing those first critical design wins - marketing, sales, support, etc. Once the processor is ready, it typically takes a year to secure that first design win and those willing to take a risk on a brand new technology from a brand new startup are typically start-ups themselves. Revenue won't start arriving until these customers develop their products and bring these to market which could be months to years depending on the market those customers serve (Turley, 2010). To add insult to injury, once the first core is ready, development needs to begin on the next generation product that will likely need to deliver higher performance and lower power at a lower cost. There is a long list of startups in this space that have failed or have been acquired for assets. In recent history, Stream Processing Inc (SPI), SiCortex, Mathstar, Ambric, Raport and Quicksilver have all vanished or been acquired for assets8 '81'82 83 Older companies **http://www.eetimes.com/design/signal-processing-dsp/4017733/Analysis-Why-massively-parallel-chipvendors-failed . 81http://www.biziournals.com/saniose/stories/2009/11/02/daily124.html 82 http://www.hpcwire.com/hpcwire/2009-05-27/powered down sicortex to sell off assets of company.html Page 91 of 106 doing concurrent processing that are no longer in business include Ardent, Convex, Encore, Floating Point Systems, Inmos, Kendall Square Research, MasPar, nCUBE, Sequent, Tandem, and Thinking Machine (Patterson, 2010). As a manager selecting suppliers, the perceived health and growth prospects for a new startup is likely an essential selection criterion. A new processor startup may have a technology that offers unparalleled benefits which could provide significant end product differentiation to the firm adopting it. However, if the supplier goes bankrupt in 2 years, the adopting firm would be in a far worse position as they'd likely need to redesign their entire product. This is particularly important for more radical multi-core processor designs because they typically require a higher degree of platform coupling - it's possible that migrating to another processor from another firm would be almost as much work as migrating to their processor. Human / Cognitive Factors and Dynamics There are several important and relevant cognitive and behavioral factors which impact the way we think and make decisions. This section will explore some of these factors and how they may be impacting the perception and adoption of multi-core processors. Thinking "Concurrently"Is Fundamentally Different There are two important fundamental challenges associated with developing applications for multi-core processors that we will explore. The first challenge is that thinking concurrently doesn't come naturally to people. The second challenge is that solving problems with sequential software is a process that is deeply ingrained in most programmers. Since the concept of the "Turing Machine"8 4 , modem mainstream computer architectures have embodied the same basic paradigm - processors operate on a sequence of instructions that are arranged to solve a given task and use a local memory system to store instructions and state information. 83 http://www.eetimescom/'electronics-news/4155235/QuickSilver-closes-operations-shops-IP s In 1937, Alan Turing, a British cryptologist, published a paper about a conceptual mechanical device that would be capable of computing any mathematical expression. This conceptual machine consisted of three parts. First, there is an endless piece of tape onto which information could be stored and retrieved - because the tape is endless, it can hold an infinite amount of information. Second, there is a read/write head that can transfer information to and from the tape. Finally, there is an application which performs a sequential set of functions on the information (Turing, 1936). 84 Page 92 of 106 Lynn Stein refers to this as a 'computational metaphor' through which we approach problems - "The computational metaphor is an image of how computing works-or what computing is made of-that serves as the foundation for our understanding of all things computational." (Stein, 1996) This pattern or 'metaphor' of sequential problem solving is not unique to the computer science profession and may actually be more ingrained in our education system and history. Some research shows that various forms of sequential thinking are taught in primary school education, and this approach to problem solving is used in areas like biology and informational science (Stein, 1996). As people develop knowledge about a specific problem solving technique for a certain class of problems, they may not be able to consider all of the alternatives when faced with a new kind of problem (Henderson, et al., 1990). As Henri Bergson once said, "The eye sees only what the mind is prepared to comprehend" - Henri Bergson (pg 24 of business dynamics) Furthermore, the manner in which a problem is represented will impact the breadth, creativity and overall quality of the solutions provided by the problem solvers. As Nancy Leveson states in her paper on Intent Specification, "Not only does the language in which we specify problems have an effect on our problem-solving ability, it also affects the errors we make while solving those problems." (2000) This implies that we have a method of problem solving that is highly ingrained not only in the software industry but almost at a societal level. Second, because that knowledge is so deeply ingrained, we may have a difficult time imaging alternative solutions when faced with a different class of problems. It is no surprise that the areas in which multi-core adoption is happening fastest (general purpose computing), a model (SMP) has been created that allows the existing thread paradigm to persist in a multi-core environment. "humans are quickly overwhelmed by concurrency andfind it much more difficult to reason about concurrentthan sequentialcode. Even careful people miss possible interleavings among even simple collections ofpartiallyordered operations." (Sutter, et al., 2005) Psychology of Gains. Losses &Endowment John Gourville cites the work of Nobel Prize winning psychologist Daniel Kahneman and psychologist Amos Tversky showing that humans, when considering product adoption, weigh Page 93 of 106 what they are going to lose by adopting that product far more heavily than what they stand to gain -an idea known as loss aversion. These studies showed that the gains must outweigh the losses by two to three times for people to find an offering attractive. This leads to what Gourville calls the "Endowment Effect" where people place higher value on the products that they already have, than those they don't yet have (2006). He also notes that most people are completely unaware of their tendency to do this. This has some potentially powerful implications for the adoption of multi-core processors. A firm's knowledge assets are part of its endowment and whether the individuals making platform decisions are aware of this or not, multi-core products which may require the destruction of a large amount of knowledge assets may have a deeper impact on the selection process than perhaps previously considered. This may imply that if a firm is able to present a multi-core technology that preserves endowment with less powerful technology may be more successful than firms which have more powerful technology but adoption would be more destructive to the endowment/knowledge assets. Variety & Choice Overload Research by Sheena lyengar shows that people find a greater number of choices debilitating and that once they've made a choice, they are generally less satisfied (Iyengar, 2010). Meta research by Benjamin Scheibenhenne on the various psychological studies around choice suggests that it may not be the number of choices but rather the amount of information that is involved in the decision and the amount of familiarity the user has with the objects they are selecting between. People who are less familiar with the options may experience a greater amount of frustration after selecting (Mogilner, et al., 2008). As noted earlier, there is less mobility within the design space of multi-core processors than single core processors. Additionally, there may be several potential configurationsthat could solve the problem and given the difficulty in predicting the performance of multi-core systems, it's often difficult. Choice overload may play a role in the adoption of multi-core because there are many different choices, adopters have little familiarity with each, they have an inability to accurately predict their success up front and these adopters may be more coupled to their decisions for some time. Page 94 of 106 9. Adoption Heuristics Now that we have established categories of phenomena and examined the causal mechanisms the drive the observed pattern of adoption for multi-core processors in embedded systems, the final section of thesis will bring these together and use this well-grounded theory to define a set of heuristics for adopting companies in the embedded systems domain and their suppliers, or would-be suppliers, of multi-core processors. Heuristics for Adopters of Multi-core Embedded Processors First, we will present a set of heuristics which firms looking to adopt a multi-core embedded processor in a design may use to predict short and long term success with a multi-core processor platform. The key considerations include the nature of the processing, the existence of legacy code, the competencies of the firm and the supporting ecosystem Nature of the processing If that nature of the processing is more general purpose, meaning that the nature of the applications that will run on the system aren't specifically known at the time of development, an SMP approach will provide the flexibility and some amount of power and performance benefits. If the nature of the processing is more fixed-function, the design may require a greater deal of optimization against that function to be competitive. If the processing has hard real time constraints, delivering hard real time with an SMP approach may require additional skills and time. If the nature of the function to be implemented is best done in software (heavy on control and conditional code, floating point math, etc), a processor platform is likely ideal. However, if the function could be more efficiently implemented in hardware, FPGAs or acceleration should be considered. If significant portions of the processing tasks are based on industry standard functions (i.e. H.264 decode) and the end product differentiation strategy is not based around delivering premium performance of this function, hardware acceleration may be more economical than implementing the functions in software. If the processing tasks are not standard functions, it is more likely that these will need to be developed in-house or be contracted out. Page 95 of 106 Existence of Legacy Code If the nature of the processing is more general purpose and POSIX-compliant multi-threaded legacy code exists, an SMP operating system running on an SMP processor may be optimal. However, if legacy code came from a 3'd party, it may not run properly in an SMP configuration and it may not be partitionable, particularly if source code isn't available. If the nature of the processing is more fixed function and legacy code exists, that code will need to be decomposed / partitioned to run on multiple cores. If an existing hardware architectural design pattern exists like the ones covered that will facilitate porting, this can significantly ease the migration process. However, if a design pattern doesn't exist, a more significant amount of work may be required. If legacy was written in a modular fashion, it is possible that the existing decomposition will allow for clean segmentation across processing resources. This of course isn't guaranteed - it's equally possible that the decomposition strategy does not lend itself well at all for concurrency and that the application performance could decrease. However, if the architecture of the legacy software can be characterized as integral, a certain amount of work will be required to decompose the legacy code into a modular structure that can be partitioned across cores. Of course, there is similar risk that the wrong decomposition strategy is chosen and multi-core implementation doesn't deliver any incremental benefit over an existing implementation. If the application lends itself to being split across processing elements (data or instruction parallelism), an optimal software architecture may be more clear. However, if the opportunities for parallelism aren't immediately obvious, more work may be required to get incremental benefits over an existing implementation. To summarize, if existing POSIX-compliant code is available, rearchitecture may not be required, but changes are likely needed to protect against the issues found with threads and multi-core. Competencies If the firm has experience segmenting applications across multiple processors, there are likely architects within the firm who understand how to predictably develop, debug and deploy concurrent applications. However, if no developers have any experience developing concurrent Page 96 of 106 applications, the likelihood of making incorrect architectural decisions or suffering from long debug periods as they learn how to track down new types of emergent bugs will increase. If the firm doesn't have expertise, it can hire experts or consultants to help. They may help the firm avoid common pitfalls that companies going through the process for the first time make. However, if the firm isn't in a position to hire or contract, there may be a steep learning curve associated with the project and development could take longer than expected. If the multi-core processor relies on standard tools, the learning curve may be more shallow for developers new to the architecture. If new tools need to be learned, this may also mean that existing development and debug skills that are coupled to the existing skills may not be transferrable. Suppliers &Ecosystem If there are multiple multi-core processor vendors whose products could potentially fit, the firm will have the option of selecting suppliers who appear to be in a better position financially and who have a greater likelihood of being around in the future. If there are a fewer number of firms whose products can meet the performance requirements and those firms have been around for fewer than 5 years and are still pushing their first generation of silicon, they may not be around for much longer. However, if there are multi-core processor solutions from established firms or even recent startups who have delivered multiple generations of products and have multiple high-profile design wins, these vendors may be safer long term bets. If the processors being considered have an established ecosystem (tools, operating systems, libraries, contractors, etc), these pieces may be leveraged to expedite the design process. If the ecosystem is nascent or nonexistent, the firm will need to develop more foundational pieces of the design from scratch and may not have access to the system partitioning, tuning, debug and optimization tools which may increase the development time. The ecosystem players may also be new to multi-core. For example, a supplier may take an existing processor core which has a rich ecosystem and offer a dual core version. In this case, the existing ecosystem players may not have the level of maturity in concurrent design to deliver reliable components. Heuristics for Suppliers of Multi-core Embedded Processors Page 97 of 106 For suppliers determining whether or not to launch a multi-core processor, the following heuristics may predict the viability of a multi-core processor and can be used as a prescription to help ensure success for new processors. Product Attributes If the processor offers revolutionary levels of performance along key trajectories, companies may be willing to invest more in adopting the product. However, if the processor provides only incremental performance levels over competing products or existing solutions, the investment level adopting firms may be willing to make will be lower. If the processor is easy to program, supports C programmability and existing software design patterns like SMP and standard multi-core libraries/APIs like pthreads, openMP and MCAPI, it will increase the likelihood that it is compatible with existing designs and it will also be perceived as easier to develop with. However, if the processor requires the use of proprietary tools and libraries to program, it will be represent a much larger investment to the adopting firm as they will need to establish a greater deal of tacit knowledge around the product. They may lose platform mobility in the process. If the processor is easy to evaluate and a potential adopter can see performance increases without spending a great deal of time learning the architecture and porting code, they may have the confidence that they'll be able to incrementally improve performance from there to achieve desired performance. This will reduce the cost and risk associated with adoption. However, if a potential adopter needs to rearchitect their application to run on this device, it may not be worth their time to even evaluate it. Target Market If the processor is designed to target a specific or a small number of specific market segments, it will likely be a more specific implementation that can't be applied to problems in other markets. The vendor may need to be prepared to develop most of the applications themselves for the target customers and the support load may be more significant. Furthermore, if those markets don't pan out, it may be more difficult to apply the same technology to other markets. If the processor is designed to be a more general purpose processor, it will compete with other solutions (hardware acceleration, FPGAs, etc) optimized for specific segments. It may be differentiated through flexibility and programmability in which case SMP and an ecosystem of Page 98 of 106 support for industry accepted operating systems like Linux and tools like GNU/GDB/Eclipse may help. Target Customers If the developers in the target market have a history with algorithm partitioning across cores, they may have the skills necessary to develop a successful product. However, if the developers have largely been developing applications for single core processors, they may need to do a great deal of work (and learning) to get their applications ported to this processor architecture. If this product will be compatible with the existing organizational structures and skill sets within the target customers, it will be an easier sell. However, if this processor fundamentally changes the way adopters develop, deploy and support products in the field, it may represent an incredibly costly investment. For example, if firms traditionally used in-house ASIC teams and this represents a programmable solution, in addition to developers, there may also be an army of field support people who are trained on the ASIC-based solution who would need to be entirely retrained to support a product built around a fundamentally different technology. If the potential customers are risk averse, it may be difficult finding a firm willing to take a chance with a fabless startup, particularly for adopters who have long product life cycles. If the performance benefit isn't high and the switching costs low, it will be difficult to get the first design win. Architectural Compatibility If this processor will require customers in the focused segments to rearchitect their software to run on multiple cores, adoption will represent a significant investment for them. Competitive Landscape If the processor required architectural, organizational changes along with several new knowledge assets, the benefit will need to be incredibly high to offset the costs of adoption. If the processor architecture is compatible with the existing product architecture, organization and skills, a lower incremental benefit may still be enough impetus for potential customers to switch. Ecosystem If the-supplier has established an ecosystem around the product to deliver tools, libraries, design support, this may take risk out of the platform for potential adopters. However, if the Page 99 of 106 adopters will need to do more heavy lifting, it will increase the cost and risk associated with adoption. Furthermore, if the ecosystem is based on proprietary tools, adopters may need to spend more time learning the new tools. Page 100 of 106 10. Conclusions In this thesis, we have reviewed several concepts related to the technology architecture and dynamics of technology adoption. We have explored several concepts related to embedded systems, parallel computing and multi-core processors. We have examined architectural design scenarios related to multi-core adoption and have shown that the existence of design patterns can facilitate the adoption of multi-core embedded processors. We examined several other adoption mechanisms related to the management strategy and organization of the adopting firm, the ecosystem of the product and cognitive / behavioral factors. Finally, we provided a set of heuristics that attempted to predict the level of difficulty an adopter of embedded multi-core processor will likely face and a similar set of heuristics on the supply side for a semiconductor supplier bringing a new multi-core processor to market. Page 101 of 106 11. Appendix A: Libraries and Extensions Facilitating the Programming of Multi-core Processors Intel Threading Building Blocks (TBB) Intel Threading Building Blocks (TBB) is a library developed by Intel specifically designed for expressing parallelism in multi-core applications 85. It is supported on Linux, QNX, FreeBSD as well Windows, OS-X and Solaris8 6. TBB allows software developers to work at an abstraction level above the platform architecture and threads and offers more scalability and performance. OpenMP OpenMP is another multiprocessing standard/API that consists of compiler directives (#pragmas) which are used to aid the compiler in parallelizing regions of code via threads (Gove, 2011). While pthreads can be used for course and fine grain parallelism, OpenMP is more suited for finer-grain parallelism (Freescale Semiconductor, 2009). MPI and MCAPI MPI (message passing interface) is a standard library used for synchronously and asynchronously passing messages between processors that was originally developed for older homogeneous distributed multiprocessor systems (Marwedel, 2011). MCAPI is similar and more modem standard that targets multi-core and more tightly coupled multiprocessor systems. It is designed to provide a low latency interface which utilizes interconnect technologies on modem multi-core homogeneous and heterogeneous multi-core systems. (The Multi-core Association, 2011). 85http://threadingbuildingblocks.org/ 86 http://threadingbuildingblocks.org/file.php?fid=86 Page 102 of 106 12. Bibliography Afuah Allan Dynamic Boundaries of the Firm: Are Firms Better off Being Vertically Integrated in the Face of Technology Change [Journal] // The Academy of Management Journal. - 2001. - pp. 1211-1228. Alexander Christopher A Pattern Language [Book]. - Oxford: Oxford University Press, 1977. AnandTech Nehalem - Everything You Need to Know about Intel's New Architecture [Online] // AnandTech. - 11 03, 2008. - 04 20, 2011. - http://www.anandtech.com/show/2594/10. ARM Holdings ARM [Online] // ARM Annual Report - Non-Financial KPIs. - 2010. - 04 23, 2011. - http://www.arm.com/annualreportl0/downloadcentre/PDF/ARM%20AR%2Ooverview.pdf. Baldwin Carliss Young and Clark Kim B. Design Rules: The power of modularity [Book]. - Cambridge : The MIT Press, 2000. BDTI Analysis: Why massively parallel chip vendors failed [Online] // EE-Times. - UMB, 1 21, 2009. - 04 29, 2011. - http://www.eetimes.com/design/signal-processingdsp/4017733/Analysis-Why-massively-parallel-chip-vendors-failed. Bergland G.D. A Guided Tour of Program Design Methodologies [Journal] / IEEE. 1981.- pp. 13-35. Bhujade Moreshwar Parallel Computing [Book]. - New Delhi : New Age International Limited, Publishers, 1995. Brutch Tasneed Software Development Tools for Multi-core Systems [Webcast]. - [s.l.]: EE Times, 2010. - Vol. September 24th. Christensen Clayton M and Raynor Michael E Why Hard-Nosed Executives Should Care About Management Theory [Article] // Harvard Business Review. - September 1, 2003. Conway Melvin E How Do Committees Invent? [Article] // Datamation Magazine. - April 1968. Crawley E ESD.34 - Lecture 6, IAP 2009 [Conference]. - 2009. - p. 39. Culler David E, Singh Jaswinder Pal and Gupta Anoop Parallel computer architecture: a hardware/software approach [Book]. - San Francisco : Morgan Kaufmann Publishers, 1999. Databeans 2008 Microcontrollers - Semiconductor Product Markets - Worldwide [Report]. - [s.l.] : Databeans, 2008. Dietrich Sven-Thorsten A Brief History of Real-Time Linux, Linux World 2006. - Raleigh: [s.n.], 2006. Eick Stephen G. [et al.] Does Code Decay? Assessing the Evidence from Change Management Data [Article] / IEEE Transactions on Software Engineering, vol 27, no 1. Jan/Feb 2001. - pp. 1-12. Emcore Magazine Test, test and test again [Article] // Emcore Magazine. - 09 2010. - pp. 16-17. Feng WuChun Making a Case for Efficient Super [Journal] // Queue. - 2003. - pp. 54-64. Frazer Rodney Reducing Power in Embedded Systems by Adding Hardware Accelerators [Online] // EE Times. - 04 09, 2008. - 04 20, 2011. http://www.eetimes.com/design/embedded/4007550/Reducing-Power-in-Embedded-Systems-byAdding-Hardware-Accelerators/. Page 103 of 106 Freescale Semiconductor Embedded Multi-core: An Introduction [Online] // freescale.com. - 2009. - 01 01, 2011. - www.freescale.com/files/32bit/doc/ref_ manual/EMBMCRM.pdf. Gartner Research Gartner Says Worldwide PC Shipments to Increase 19 Percent in 2010 with Growth Slowing in Second Half of the Year [Online] // Gartner Newsroom. - August 31, 2010. - 04 11, 2011. - http://www.gartner.com/it/page.jsp?id=1429313. Gentile Rick Processor Applications Engineering Manager [Interview]. - 04 19, 2011. Golda Janice and Philippi Chris Managing New Technology Risk in the Supply Chain [Article] // Intel Technology Journal, Volume 11, Issue 2. - 2007. - pp. 95-104. Gourville John Understanding the Psychology of New-Product Adoption [Journal] // Harvard Business Review. - 2006. - pp. 99-106. Gove Darryl Multi-core Application Programming for Windows, Linux and Oracle Solaris [Book]. - Boston: Addison-Wesley Professional, 2011. Henderson Rebecca and Clark Kim Architectural Innovation-The Reconfiguration of Existing Product Technologies [Article] // Administrative Science Quarterly. - 1990. - pp. 9-30. Hovsmith Skip Getting started with multi-core programming: Part 1 [Online] // EE Times. UBM, 7 7, 2008. - 04 21, 2011. - http://www.eetimes.com/design/embedded/4007623/Gettingstarted-with-multi-core-programming-Part-1. IDC Worldwide Server Market Rebounds Sharply in Fourth Quarter as Demand for Blades and x86 Systems Leads the Way, According to IDC [Online] // IDC. - 02 24, 2010. - 04 11, 2011. - http://www.idc.com/getdoc.jsp?containerld=prUS222245 10. intel http://www.cs.umbc.edu/portal/help/architecture/24531701.pdf [Article] // Intel IA-64 Architecture Software Developer's Manual. - January 2000. - pp. 6-21. Iyengar Sheena The Art of Chosing [Book]. - New York: Hachette Book Group, 2010. Kuhn Thomas The Structure of Scientific Revolutions (2nd ed) [Book]. - Chicago: University of Chicago Press, 1970. Kumar Rakesh Fabless Semiconductor Implementation [Book]. - [s.l.] : McGraw-Hill, 2008. Kundojjala Sravan Baseband Vendors Will Take One-Third of the Smartphone Multi-Core Apps Processor Market in 2011 [Online] // strategyanalytics.com. - 01 19, 2011. - 03 19, 2011. http://blogs.strategyanalytics.com/HCT/category/Handset-Component-Technologies.aspx. Lee Edward A The Problem with Threads [Report]. - Berkeley : University of California at Berkeley, 2006. Leveson Nancy Intent Specifications: An Approach to Building Human-Centered Specifications [Article] // IEEE Transactions on Software Engineering. - 1 2000. - pp. 15-35. Leveson Nancy Software Engineering: A Look Back and A Path to the Future [Online]. December 14, 1996. - 04 04, 2011. - http://sunnyday.mit.edu/16.355/leveson.pdf. Levy Marcus The Adoption of Multi-core [Online] / National Instruments. - 06 02, 2008. 0101,2011. - http://ni.adobeconnect.com/p77017465/?launcher=false&fcsContent-true&pbMode=normal. MacCormack Alab, Rusnak John and Baldwin Carliss Exploring the Structure of Complex Software Designs: An Empirical Study of Open Source and Proprietary Code [Article] // Management Science. - 2006. McCanny Jim Trust but Verify: increasing IP incorporation in 2011 SoC Design [Online] // Chip Design Magazine. - Spring 2011. - 04 20, 2011. http://chipdesignmag.com/display.php?articleld=4800. McKenney Paul E Is Parallel Programming Hard, And, If So, What Can You Do About It [Book]. - Beaverton: IBM Linux Technology Center, 2010. Page 104 of 106 Mogilner Casie, Rudnick Tamar and Iyengar Sheena The Mere Categorization Effect: How the Presence of Categories Increases Choosers [Journal] // Journal of Consumer Research. 2008. - pp. 202-215. Moore Gordon E Cramming more components onto integrated circuits [Journal] // Electronics. - 1965. Moynihan Finbarr Marketimng Director, Mediatek [Interview]. - Boston: [s.n.], 04 20, 2011. Murmanna Johann Peter and Frenken Koen Toward a systematic framework for research on dominant designs, technological innovations, and industrial change [Article] // Research Policy. - 2006. - Vols. 25, pgs 925-952. Niebeck Bob MIT ESD.36 Guest Lecturer - Fall 2009. - 10 2009. Norman D.A. Things That Make Us Smart [Book]. - [s.l.] : Addison Wesley Publishing Company, 1993. Parnas D.L. On the Criteria To Be Used in Decomposing Systems into Modules [Journal] // Communications of the ACM. - 1972. - pp. 1053 - 1058. Parsons David Object Oriented Programming with C++ [Book]. - New York : Continuum, 1994. Patterson David The Trouble with Multicore [Article] / IEEE Spectrum. - July 2010. - pp. 28-32, 52-53. Rajan Hridesh, Kautz Steven M. and Rowcliffe Wayne Concurrency by Modularity: Design Patterns, a Case in Point [Conference] // Onward!. - Reno : [s.n.], 2010. - Vols. pg 790805. Raymond Eric The Cathedral and the Bazaar [Online] // Eric S. Raymond's Home Page. - 08 02, 2002. - 04 20, 2011. - http://www.catb.org/-esr/writings/homesteading/cathedral-bazaar/. Richie Dennis M The Development of the C Language [Online] // Bell Labs. - 2003. - 04 11, 2011. - http://cm.bell-labs.com/cm/cs/who/dmr/chist.html. Sandia Labs More chip cores can mean slower supercomputing, Sandia simulation shows [Online] // Sandia Labs. - 1 12, 2009. - 04 21, 2011. https://share.sandia.gov/news/resources/newsreleases/more-chip-cores-can-mean-slowersupercomputing-sandia-simulation-shows/. Simon C.A. and Simon H.A. In search of insight [Journal] // Cognitive Psychology. 1990.-p.vol22. Simon Herbert A The Architecture of Complexity [Journal] // Proceedings of the American Philosophical Society. - 1962. - pp. 467-482. Simon Herbert A The Architecture of Complexity [Conference] // Proceedings of the American Philosophical Society. - [s.l.] : American Philosophical Society, 1962. - Vols. 106, No 11, p4 6 7 -4 8 2 . Starnes Tom Software Development Tools for Multicore Systems, Approaching Multicore Conference [Webcast]. - [s.l.] : EE Times, 2010. - Vol. September 24th. Stein Lynn Challenging the Computational Metaphor: Implications for How We Think [Report]. - 1996. Sterman John D. Business Dynamics: Systems Thinking and Modeling for a Complex World [Book]. - Cambridge : McGraw Hill, 2000. Sutter Herb and Larus James Software and the concurrency revolution [Article] // ACM Queue. - 2005. - September. - 7 : Vol. 3. Tabirca Sabin Introduction to Parallel Computing [Online] // Department of Computer Science - University of College Cork. - 09 06, 2003. - 04 15, 2011. - http://www.cs.ucc.ie/-stabirca/AM601 1/llnlpp/index.htm. Page 105 of 106 Turing A. M. ON COMPUTABLE NUMBERS, WITH AN APPLICATION TO THE ENTSCHEIDUNGSPROBLEM [Article] // Proc. London Math. Soc.. - May 27, 1936. - pp. 230264. Turley Jim Editorial: How to Blow $100 Million [Online]. - 09 23, 2010. - 01 20, 2011. http://www.mdronline.com/editorial/edit24_34.html. UBM/EE Times Group 2010 Embedded Market Study [Report]. - [s.l.] : UBM/EE Times Group, 2010. UBM/EE-Times 2011 Embedded Market Study [Online] // EE Times. - 04 08, 2011. - 04 08, 2011. - http://www.eetimes.com/electrical-engineers/education-training/webinars/4214387/2011Embedded-Market-Study. Uhrig Sascha Evaluation of Different Multithreaded and Multicore Processor Configurations for SoPC [Conference] // SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation. - Heidelberg Springer-Verlag Berlin, 2009. - pp. 68-77. Utterback James Mastering the Dynamics of Innovation [Book]. - Cambridge : Harvard Business School Press, 1994. VDC Research Embedded Engineers Experience with Multicore and/or Multiprocessing Designs [Online] // VDC Research. - 02 15, 2011. - 04 06, 2011. http://blog.vdcresearch.com/embeddedsw/2011/02/embedded-engineers-experience-withmulticore-andor-multiprocessing-designs.html. VDC Research Executive Brief: 2010 Embedded Processors, GLobal Market Demand Analysis [Online] // VDC Researcg. - 02 2011. - 04 11, 2011. http://www.vdcresearch.com/_Documents/tracks/tlv lbrief-2637.pdf. VDC Research VDC Research [Online] // 2010 Service Year Track 2: Embedded System Engineering Survey Data, Vol 5: Procesor Architecture Executive Brief. - 09 2010. - 04 02, 2011. - http://www.vdcresearch.com/_Documents/tracks/t2v5brief-2627.pdf. Williamson Oliver E The Economic Institutions of Capitalism [Book]. - New York : The Free Press, 1985. Page 106 of 106