Technically Superior But Unloved: A Multi-Faceted ... to Meet Expectations in Embedded Systems

advertisement
Technically Superior But Unloved: A Multi-Faceted Perspective on Multi-core's Failure
to Meet Expectations in Embedded Systems
by
MASSACHUSETTS INSTITE'
Daniel Thomas Ledger
OF TECHNOLOGY
B.S. Electrical Engineering
JUL 2 0 2011
Washington University in St. Louis 1996
LIBRARIES
B.S., Computer Engineering
ARCHIVES
Washington University in St. Louis 1997
SUBMITTED TO THE SYSTEM DESIGN AND MANAGEMENT PROGRAM IN PARTIAL
FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN ENGINEERING AND MANAGEMENT
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
JUNE 2011
@2011 Daniel Thomas Ledger. All rights reserved.
The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic
medium now known or hereafter created.
copies of this thesis document in whole or in part
K,-
/II
Signature of Author:
Daniel Thomas Ledger
Fellow, System,
^//
ign and Management Program
May
6
h, 2011
Certified By:
Senior Lecturer, Engineering
ystems Division and the Sloan School of Management
Thesis Supervisor
Accepted By:
Patrick Hale
Senior Lecturer, Engineering Systems Division
Director, System Design and Management Fellows Program
Technically Superior But Unloved: A Multi-Faceted Perspective on Multi-core's Failure to
Meet Expectations in Embedded Systems
By
Daniel Thomas Ledger
Submitted to the System Design and Management Program on May 6 th, 2011 in Partial
Fulfillment of the Requirements for the Degree of Master of Science in Engineering and
Management
Abstract
A growing number of embedded multi-core processors from several vendors now offer
several technical advantages over single-core architectures. However, despite these advantages,
adoption of multi-core processors in embedded systems has fallen short of expectations and not
increased significantly in the last 3-4 years.
There are several technical challenges associated with multi-core adoption that have been
well studied and are commonly used to explain slow adoption. This thesis first examines these
technical challenges of multi-core adoption from an architectural perspective through the
examination of several design scenarios. We first find that the degree of the technical challenge
is highly related to the degree of architectural change required at the system level. When the
nature of adoption requires higher degrees of architectural change, adoption is most difficult due
to the destruction of existing product design and knowledge assets. However, where adopting
firms can leverage existing architectural design patterns to minimize architectural change,
adoption can be significantly easier.
In addition to the architectural challenges, this thesis also explores several other factors that
influence adoption related to management strategy, organization, ecosystem, and human
cognitive and behavioral tendencies. Finally, this thesis presents a set of heuristics for potential
adopters of multi-core technology to assess the suitability and risk of multi-core technology for
their firm's products, and a second set of heuristics for supplier firms developing or selling
multi-core processors to determine their likely success.
Thesis Supervisor: Michael A.M. Davies
Title: Senior Lecturer, Engineering Systems Division and Sloan School of Management
Page 2 of 106
Acknowledgements
I would like to offer my gratitude to the colleagues who have contributed to this thesis and
my degree at MIT. Thank you for your precious time, your ideas, and the great discussions your insights have been invaluable in shaping this thesis.
To the community of students and professors at MIT, thank you for an incredible experience
over the last 30 months. It's been a pleasure and an honor getting to know so many wonderful
and talented people.
To my thesis advisor, Michael Davies, thank you for the time, support and encouragement
over the last year. The knowledge and guidance you've provided as both a professor and a thesis
advisor have been so valuable.
To Pat Hale and the SDM team, thank you for creating and running such an incredible
program.
To my better half, Lauren, and our two young boys, Andrew and David - thank you for the
love, patience, support, compassion, understanding and help over the last 30 months. It goes
without saying that none of this would have been possible without you.
To my friends and extended family, thank you all for the love, support and tolerance.
To my employer, Analog Devices, thank you for flexibly and support in allowing me to
pursue this degree on a part time basis.
Page 3 of 106
Table of Contents
Abstract.................................................................................................................................
2
A cknowledgem ents............................................................................................................
3
Table of Contents...............................................................................................................
4
List of Figures.......................................................................................................................
8
1
Introduction and M otivation......................................................................................
10
2.
Theory Creation M ethodology ....................................................................................
12
3.
A rchitecture, Innovation and Dom inant Designs........................................................
15
4.
Structure & Architecture..................................................................................................
15
Com plexity and Com plicatedness...............................................................................
15
Decom position & M odularity ......................................................................................
16
Hierarchy.........................................................................................................................
18
Design Patterns ............................................................................................................
19
Dynam ics of Technology Evolution and Innovation......................................................
20
Product Know ledge and Assets ...................................................................................
20
Dom inant Designs.......................................................................................................
21
Technology Innovation ...............................................................................................
22
Em bedded System s ....................................................................................................
25
Em bedded System s and Embedded Processors .............................................................
25
Em bedded Operating System s ........................................................................................
27
System and Processor D iversity.................................................................................
28
M ore Lim ited Resources.............................................................................................
29
Platform "Stickiness"................................................................................................
.29
Product Life Cycle & Legacy .................................................................................
.. 30
Real Tim e Constraints..........................................................................................
. .. 31
Page 4 of 106
5.
6.
7.
Platform Certification .....................................................................................................
31
Sum m ary.........................................................................................................................
31
Form s of Parallelism & Concurrency.............................................................................
33
Granularity ..........................................................................................................................
33
Types of parallelism ........................................................................................................
33
Bit Level Parallelism ...................................................................................................
34
Instruction Level Parallelism (ILP) ................................................................................
34
Task Parallelism & Operating System s .......................................................................
36
Sum m ary.........................................................................................................................
40
Attributes of M ulti-core Processors............................................................................
41
Number of cores...............................................................................................................
41
Types of Cores - Hom ogeneous and H eterogeneous ......................................................
41
Resource Sharing ................................................................................................................
42
M emory Architecture.......................................................................................................
43
Shared M em ory..........................................................................................................
45
Distributed M em ory.....................................................................................................
47
Hybrid Variants..........................................................................................................
47
M ulti-Threading...............................................................................................................
47
Asymmetric Multiprocessing (AMP) and Symmetric Multiprocessing (SMP) .............
48
The Future of SM P.....................................................................................................
50
Em bedded M ulti-core Processors A doption ...............................................................
52
Benefits of M ulti-core in Embedded System s ...............................................................
52
Perform ance....................................................................................................................
52
Pow er Dissipation.......................................................................................................
53
Size / D ensity / Cost.....................................................................................................
54
A rchitectural Factors & Challenges.................................................................................
55
Page 5 of 106
8.
9.
Structure / Partitioning................................................................................................
56
Dynam ics / Interactions ................................................................................................
61
System Optim ization and Debug ....................................................................................
64
Adoption Scenarios..........................................................................................................
66
Current Multi-core Adoption Patterns .......................................................................
67
Case 1: Existing Multiprocessor Hardware Design Pattern.........................................
67
Case 2: Existing Software Design Pattern ..................................................................
71
Case 3: No Existing Patterns with Legacy Code ........................................................
74
Case 4: No Existing Patterns with New Code .............................................................
74
Sum m ary of Adoption Scenarios.................................................................................
75
System Level Factors and Challenges............................................................................
77
Factors and Dynamics within an Adopting Firm........................................................
77
Factors and Dynamics Surrounding a Firm ...............................................................
86
Human / Cognitive Factors and Dynamics ..................................................................
92
- .......
Adoption Heuristics.........................................................................
Heuristics for Adopters of Multi-core Embedded Processors .........................................
95
95
Nature of the processing ............................................................
95
Existence of Legacy Code .......................................................
96
........
... 96
Suppliers & Ecosystem.................................................
97
Heuristics for Suppliers of Multi-core Embedded Processors.......................................
97
............
98
Competencies.............................................................
.- --
Product Attributes ..........................................................
-
Target Market...............................................................................
98
Target Customers ........................................................
99
Architectural Compatibility ...........................................
99
Competitive Landscape.....................................................................
99
Page 6 of 106
Ecosystem .......................................................................................................................
99
10.
Conclusions ..............................................................................................................
101
11.
Appendix A: Libraries and Extensions Facilitating the Programming of Multi-core
Processors 102
12.
Bibliography .............................................................................................................
103
Page 7 of 106
List of Figures
Figure 1: Thesis structure........................................................................................................
11
Figure 2: Observation, Categorization, Formulation Pyramid............................................
14
Figure 3: Containment hierarchy example - Linux file structure ........................................
19
Figure 4: Modular System Level Innovation, Radical Sub-System Level Innovation........ 24
Figure 5: 2008 Microcontroller Revenue by Market (Databeans, 2008)............................
26
Figure 6: OS market share for desktop computers.............................................................
27
Figure 7: Most commonly used operating systems in embedded systems .........................
28
Figure 8: 2008 Microprocessor Revenue by Processor Type (Databeans, 2008)............... 34
Figure 9: Threads and Process - Possible Configurations ..................................................
37
Figure 10:PO SIX A PIs ......................................................................................................
40
Figure 11: Intel Media Processor CE 3100 Bock Diagram ................................................
42
Figure 12: Apple A 5 dual-core ARM A9 ...............................................................................
42
Figure 13: Intel "Yorkfield" Quad-core MCM ..................................................................
43
Figure 14: Memory access times (core clock cycles) for different levels of memory........ 44
Figure 15: Memory Latency on ARM Cortex A9 ..................................................................
45
Figure 16: Shared Memory Architecture ............................................................................
46
Figure 17: Shared Memory Architecture with Cache Coherency.......................................
46
Figure 18: Distributed Memory Architecture ....................................................................
47
Figure 19: Single Threading vs. Super-threading vs. Simultaneous Multithreading.......... 48
Figure 20: AM P Configuration...........................................................................................
49
Figure 21: SM P Configuration...........................................................................................
49
Figure 22: Intel Processor Clock Speed by Year...............................................................
51
Figure 23: Power Consumption Comparison of single and dual core implementations of the
Freescale M PC 864 1......................................................................................................................
53
Figure 24 Dynamic Current over Frequency for the ADSP-BF533 Embedded Processor..... 54
Figure 25: Multi-core architecture challenges ....................................................................
56
Figure 26: Cisco Telepresence System ..............................................................................
57
Figure 27: Performance scaling by number of threads and percentage of parallel code ........ 58
62
Figure 28: Multi-core performance vs. number of cores ...................................................
Figure 29 : Performance impact as a function of cores (2-3 threads per core)...................
62
Figure 30: Thread scaling with exaggerated synchronization overhead.............................
63
Page 8 of 106
Figure 31: Case 1, heterogeneous architecture ..................................................................
67
Figure 32: Case 1, homogeneous architecture ....................................................................
68
Figure 33: Case 1, no new resource sharing .......................................................................
69
Figure 34: ADSP-14060 Quad processor DSP ..................................................................
69
Figure 35: Case 2, new resource sharing ............................................................................
70
Figure 36: C6474 Block D iagram ......................................................................................
70
Figure 37: Case 2, Symmetric Multiprocessing..................................................................
71
Figure 38: Migration from single-core to dual-core with SMP & POSIX .........................
72
Figure 39: Case 3, Migration to homogeneous and heterogeneous scenarios ....................
74
Figure 40: Case 4, New development to homogeneous SMP, homogeneous AMP or
heterogeneous M C ........................................................................................................................
75
Figure 41: Multi-core Adoption Scenarios .........................................................................
76
Figure 42: Layers of Adoption Factors & Dynamics...........................................................
77
Figure 43: WW Respondents Working with/Programming Multi-core and/or Multiprocessor
D esign s..........................................................................................................................................
81
Figure 44: Xilinx Zynq-7000 Extensible Processing Platform............................................
82
Figure 45: Nvidia Tegra 2 Processor ...................................................................................
83
Figure 46: Performance and Power Gains with Hardware Acceleration (Frazer, 2008)........ 84
Figure 47: Cost per gate by process node (Kumar, 2008) ...................................................
84
Figure 48: SPI's High End SP16HP-G220 Processor Block Diagram................................
88
Figure 49: Tilera's TILE64 Processor Block Diagram ........................................................
89
Figure 50: "Valley of Death" for a revolutionary technology (Golda, et al., 2007) ...........
91
Figure 51: "Valley of Death" for an evolutionary technology (Golda, et al., 2007)........... 91
Page 9 of 106
1. Introduction and Motivation
A growing number of embedded multi-core processors from several vendors now offer
several technical advantages over single-core architectures. Multi-core processors offer
increased computational density - a quad core processor has 4x the theoretical computational
power as single core version of that device. Multi-core processors may use less power to
accomplish a similar task - having two cores running at a slower clock speed can be more power
efficient than a single core running at a higher clock speed1 . Multi-core processors are smaller
and typically less expensive than multiple single-core devices - it's possible to migrate an
existing design that used multiple discrete single-core devices into a single multi-core device, for
example. For some applications, multi-core provides increased reliability by reducing the
number of discrete parts.
Despite the numerous advantages that multi-core architectures offer, developing a product
using a multi-core processor architecture is challenging. Over the years, there has been a great
deal of research aimed at studying the technical challenges of multi-core and the concomitant
challenge of concurrentprogramming and proposing new ways to approach them. In the course
of researching this thesis, I came across several papers, articles, blog posts, and forum threads
describing the difficulties associated with concurrent programming and multi-core architectures.
Adoption of multi-core processors in embedded systems has not increased significantly in the
last 3-4 years. In the embedded space, multi-core processor usage has only increased 6% across
systems that use multiple processors between 2007 and 2010 (UBM/EE Times Group, 2010).
This thesis first explores these technical challenges of multi-core adoption from an
architectural perspective through the examination of several design scenarios. We first find that
the degree of the technical challenge is highly related to the degree of architectural change
required at the system level. When the nature of adoption requires higher degrees of
architectural change, adoption is most difficult due to the destruction of existing product design
and knowledge assets. However, where adopting firms can leverage existing architectural design
patterns to minimize architectural change, adoption can be significantly easier and hence faster.
This thesis also explores several other factors that affect adoption mechanisms at the
organizational, managerial, value chain and cognitive levels.
'See Figure 23 on page 49
Page 10 of 106
This thesis is arranged in five parts as shown in Figure 1. Chapter 2 provides a high level
framework that is used in developing theories about multi-core adoption. Chapter 3 then
introduces several concepts related to system architecture and technology innovation that will be
used throughout the thesis. Chapter 4, 5 and 6 provide contextual information on the nature of
embedded systems, embedded processors, parallel architectures, and multi-core processors.
Chapters 7 and 8 explore multi-core adoption patterns, categories and causal mechanisms at the
architectural level as well those related to management strategy, organization, ecosystem/value
chain, and human behavioral and cognition.
Finally, Chapter 9 proposes two sets of heuristics. The first set of heuristics predict the
likelihood of success of a firm adopting a multi-core processor in a product design on the
demand side. The second set of heuristics characterizes successful multi-core product offerings
on the supply side.
Categorization
Causal Mechanisms
Architectural
Mgmt, Org, Value Chain, Cognitive
Figure 1: Thesis structure
Page 11 of 106
2. Theory Creation Methodology
Multi-core processors offer several technical advantages over single core processors yet
despite these advantages, adoption in the embedded space has been slow and has fallen short of
expectations. The goal of this thesis is to establish a set of theories to explain this anomaly and
use these theories to predict successful adoption patterns for multi-core processors. We will do
this by first categorizing and studying the various causal mechanisms that lead to adoption and
then developing a set of heuristics which can be used to predict adoption patterns for adopters
and prescribe successful strategies for suppliers.
There is a great deal of literature and commentary around the challenges of multi-core
development and it is tempting to conclude that multi-core adoption is happening slowly just
because it is hard. However, multi-core has become established in several types of embedded
applications like wireless infrastructure and networking equipment. And despite slow adoption
in general, multi-core processors are being rapidly adopted in some specific segments such as
smart phones and tablets.
These phenomena are also anomalies that warrant robust explanations, that is a
well-grounded theory which explains the underlying causal mechanisms that have led to these
observations about the adoption of multi-core in embedded systems.
Clayton Christensen notes that several management books today present management
theories which prescribe a series of actions because those actions have lead to certain results for
some firms in the past. However, these texts often fail to present how and why the actions lead
to the desirable results. They highlight a correlationbetween an action and a result without
understanding and presenting the causal mechanism that connects the two. Attempting to repeat
the action that correlates with desirable results and expecting the same can be a very
disappointing exercise (Christensen, et al., 2003).
This tendency to rely on correlation as a comfortable substitute for causality is deeply
ingrained in our thinking and behavior. Our minds establish mental models of our complex
surroundings based primarily on observable causal relationships in our environments.
Unbeknownst to us, our minds will seamlessly default to casual relationships as a means to
explain phenomena and predict future occurrences (Sterman, 2000).
Page 12 of 106
There has been a great deal of research on this subject. The field of system dynamics in
particular focuses heavily on the limitations of the mental models we create to explain the world
around us. John D. Sterman cites several important pieces of research in his book, Business
Dynamics. He cites (Axelrod, Hall, D6rner and Plous) that suggest the following limitations.
The concept of feedback is often absent from our mental models/cognitive maps (Axelrod). We
tend to think in "single strand causal series" and find it difficult to comprehend systems with
feedback loops and multiple causal pathways (Drner). Furthermore, we tend to think that each
effect comes from a single cause (Plous). Finally, people have difficulty understanding the
relationships between phenomena when random error, nonlinearly and negative correlation are
involved (Brehmer). Sterman also notes that we have a very short term memory for cause and
effect and when events are separated by more than a few minutes, it's often very difficult for us
to associate them.
We struggle with complex systems and the dynamics of those systems and we, to the best of
our abilities, will attempt to use correlation to explain behaviors of systems simply because that's
how our minds are designed to work.
When it comes to forming theories, another important tendency we have as humans (that
we're also typically unaware of) is our tendency to filter information based on preexisting
beliefs. An existing established paradigm or belief may suppress our ability to perceive data that
is inconsistent with this existing paradigm which limits our abilities to see new paradigms
emerging (Kuhn, 1970). Once we believe the world is a certain way, we cannot easily see
evidence that suggests differently.
So it's only natural that we as humans are often satisfied with a correlation between events to
explain causation because our minds make it feel so convincing. Yet as we deal with
increasingly complex systems, these cognitive limitations can lead us into some very misguided
beliefs that we may later have a difficult time parting with.
With respect to multi-core, there is currently a correlationbetween the fact that developing
products with multi-core processors is technically challenging and that adoption rates are
generally low. However, this doesn't explain why multi-core isn't being adopted because, in
many cases it has been adopted and other cases, it's being rapidly adopted.
A good theory is a statement which explains how and why certain actions can lead to certain
results - they help explain what is happening in the present and also what will likely happen in
Page 13 of 106
the future (Christensen, et al., 2003). Using Christensen's framework as presented in Figure 2
below, we will first attempt to categorize the anomalies and identify causal mechanisms not only
as they pertain to the technology itself but also to the surrounding layers like the management
strategy, organization of the adopting firm, the structure and dynamics of the ecosystem
surrounding the technology and cognitive factors that contribute to the adoption process as well.
Prediction
Formation of a theory:
A statement of what
causes what and why
Categorization
Confirmation
Anomna y
Observation and description
of the phenomenon
Figure2: Observation, Categorization,FormulationPyramid2
From these causal mechanisms, we will present a set of heuristics which can be used to
predict the likelihood of a successful adoption of multi-core processors by adopting firms and
also prescribe strategies that can predict success for suppliers of multi-core processors.
The process of categorization of the phenomena and identification of causal mechanisms that
will be used later in this thesis relies on several concepts related to the structure and architecture
of systems and the dynamics behind technology evolution and innovation that will be explored in
the next section. From here, we will explore several important topics related to the unique
characteristics of embedded systems, forms of parallelism in processors and the key attributed of
multi-core processors that will be used as we categorize the adoption patterns of multi-core
processors in embedded systems.
2
(Christensen, et al., 2003)
Page 14 of 106
3. Architecture, Innovation and Dominant Designs
The adoption of multi-core processors can require changes to an end-product design at both
the component and at the architecturallevel. There are several key concepts related to the
product architecture and the dynamics of innovation and adoption that will be used throughout
this thesis which we will present in this section.
Structure &Architecture
Programming multi-core processors is centered on a paradigm of breaking problems into
smaller pieces so the work can be distributed across computing elements and reliably processed
concurrently. This involves determining how to take large pieces of complex software and
partition them across multiple cores in a way that it runs reliably and with more performance.
We are dealing with software systems, which unlike physical systems are capable of virtual
complexity; thus the management of complexity plays a key role in the process of using multicore processors. Modularityand hierarchyalso play an important role in managing
complicatedness and complexity; they are valuable tools for decomposing larger systems into
smaller pieces. As a result, we will first explore the relevant concepts of complexity,
complicatedness, modularity and hierarchy.
The end-products tend to be incremental in nature and reuse from existing designs is very
common. As designpatternsborrowed from existing designs or elsewhere in industry can
facilitate adoption of multi-core, we will also explore the concept of design patterns.
Complexity and Complicatedness
Complexity is a quantifiable attribute of an architecture that describes the number of
elements in the architecture, the degree of interconnectedness between those elements and, by
some definitions, the level of dynamics present in the architecture. Complexity is an important
attribute of any architecture that needs to be understood, developed or managed by a human
because our capabilities as human beings to manage complexity are both limited and
unfortunately not evolving at a rate to keep up with the complexity of the systems we are
designing.
Page 15 of 106
As we create larger systems using larger teams and interconnect these systems with an
increasing number of linkages, our ability to comprehend, develop, predict and manage the
behaviors of these systems is becoming increasingly limited. Thus we need to rely on tools and
methodologies to assist in the design of systems whose complexity surpasses the capabilities of a
single human being (Baldwin, et al., 2000). In the immortal words of Professor Ed Crawley,
"Complexity is neither good nor bad, but must be managed." (2009)
Complexity is especially critical in the study of software and embedded systems because
unlike physical systems which are bound by the laws of nature, software systems are practically
unbounded and thus have the potential for almost unlimited complexity in comparison the
physical systems we create (Leveson, 1996). And, not only do these systems themselves become
more complex but the human organizational system responsible for coordinating the
development of these systems must also become more complex (Baldwin, et al., 2000).
Complicatedness is a closely related system attribute that is sometimes used interchangeably
with complexity but is distinct in both definition and importance in the scope of architecture.
Complicatedness is a qualitative metric that refers to the difficulty humans have comprehending
systems and this attribute is inherently more subjective. A key challenge of a system architect is
to manage the evolution of a system's complexity in manner in which a complex system doesn't
appear to be complicated (Crawley, 2009).
Decomposition & Modularity
A module is a collection of elements in a system that has been grouped by a common intent
and in a manner which minimizes interaction between other modules.
Modular design is typically a tops-down process whereby a larger system is decomposed into
smaller modules. Modular design starts with a high level system design which is then
recursively decomposed into smaller modules until the complicatedness of a single module can
be comprehended by a single individual, and its complexity can be isolated and hidden through a
simple interface design through which it connects to adjacent modules (Baldwin, et al., 2000).
Modularity is an important principle for designing, developing and maintaining complex
system because it both improves the comprehensibility and reduces the complicatedness of a
system. In the domain of software, decomposing systems into modules and providing clean
abstraction layers via interfaces has become a standard practice today because it helps us deal
with the relatively unbounded levels of complexity.
Page 16 of 106
Modularity also provides a means for multiple people to work on different parts of the
system simultaneously. People don't need to understand the whole system; rather they need to
understand how their module operates and how it must interface with the components around it.
"If an artifactcan be divided into separateparts,and the partsworked on by
different people, the 'one person' limitation on complexity disappears.But this
implies that techniquesfor dividing effort and knowledge arefundamental to the
creation of highly complex manmade things." (Baldwin, et al., 2000)
Fundamental to this capability is the concept of information hiding. Originally proposed by
David Pamas (1972), information hiding is the practice of hiding certain functions and
information within the module. As long as the module's interfaces are preserved, changes to the
hidden functions and information don't impact other modules in the system. This is an essential
attribute if multiple developers are to simultaneously work on multiple modules within the
system.
A fantastic modem example of a large-scale, highly modular product design is the Linux
operating system. Alan MacCormack's research on Linux shows that more distributed design
teams must rely on modular design because the level of communication between the engineers is
significantly lower. By clearly defining modules and interfaces, portions of a distributed team
can work within a module without needing to understand how the whole system works. This
becomes particularly important as the size and complexity of the software increase. Without a
modular design, it would be extremely difficult for developers to learn how the whole system
works and coordinating the development would be close to impossible (MacCormack, et al.,
2006).
In order to decompose a system into modules, that system must first be decomposable. The
decomposability of a system is the extent to which it can be iteratively decomposed in a manner
in which high-frequency interactions occur within a sub-system and low-frequency interactions
occur between subsystems (Simon, 1962). The attribute of decomposability is central to our
discussion on multi-core systems because in the process of migrating to a single-core to a multicore design, we must decompose the system into small pieces that can be distributed to these
cores.
If a system is decomposable, a key challenge is developing the right strategy to decompose
the system. For tangible objects that are more naturally bounded in their complexity, there may
Page 17 of 106
be intuitive points of modularization. However, for software systems which may consist of
several elements in a highly interconnected configuration, the boundaries may not always be
clear (Baldwin, et al., 2000).
There are several decomposition strategies that are useful for different types of systems,
particularly in software which is significantly less bounded than the physical domain. Systems
can be decomposed by breaking down the large system functions into small steps. This is called
functional decomposition, which happens to be one of the oldest and most common software
architecture methodologies, and has been widely taught and known (Bergland, 1981). There are
also very different approaches like decomposing a system based on interactions, which has been
promoted as possibly a more optimal approach for multi-core processor architectures (Stein,
1996).
Decisions in the decomposition and modularization process need to be made very carefully
as the grouping of elements and definition of interfaces will likely exist for the entire lifespan of
the product. As Bob Niebeck of Boeing noted about architecting aircraft, "as soon as you make
something common, you're living with it for the life of the plane." (2009)
Hierarchy
The hierarchy of a structure typically describes how components in the structure are
associated. An organizationalhierarchyof a company, for example, is used to describe a
ranking of individuals and who they are subordinate to. A containmenthierarchydescribes
which components of a system contain other components. The Linux file system hierarchy
shown in Figure 3 is an example of a containment hierarchy - certain directories contain other
directories which in turn contain more directories.
Page 18 of 106
lxduJ
Lxterin
lxvi
(agJLhowto]Lpackagj
Figure 3: Containment hierarchy example - Linux file structure3
A compositionalhierarchyrecursively describes the compositional structure of a component
in terms of the sub-components it consists of (Parsons, 1994). This is very similar to the
hierarchicsystem concept that Herbert Simon proposed: one that is composed of interrelated
sub-systems which are in turn composed of smaller, interrelated sub-systems (Simon, 1962).
The terms components and sub-systems are synonymous with each other and in many ways, are
synonymous with the concept of modules described earlier; as a module can be composed of
several smaller modules.
Computing systems are often organized as compositional hierarchies in both their hardware
design and their software design. In a hardware design, there may be a Printed Circuit Board
(PCB) containing several Integrated Circuits (ICs) which is a system. Each of these integrated
circuits can themselves be complex sub-systems like processors or field programmable gate
arrays (FPGAs) that are in turn composed of several further interrelated sub-systems. In the
software domain, large applications consist of smaller sub-systems. For example, a word
processing application may consist of a sub-system to manage the display and user interface, one
to manage spell checking, one to manage file storage and retrieval, etc.
Design Patterns
The concept of design patterns originated in Christopher Alexander's studies of cities and
architecture. Alexander defined a design pattern as the "core of the solution" to a recurring
3http://www-uxsup.csx.cam.ac.uk/pub/doc/suse/suse9.3/suseIinux-useryuide
en/cha.shell.html
Page 19 of 106
problem, which allows the solution to be applied universally without "ever doing it the same way
twice". (Alexander, 1977). He prescribes that the solutions present in these design patterns be
presented in a general and abstract manner that allows them to be easily applied to problems.
The use of design patterns has become popular in the last few decades amongst programmers
for obvious reasons: many programmers are solving similar problems in very different contexts.
The availability of a proven pattern that can be adapted to the problem at hand has clear merit.
As we will see later in this thesis, architectural design patterns can play a key role in easing
the difficulties associated with multi-core adoption and development.
Dynamics of Technology Evolution and Innovation
Firms developing products based around processors accumulate a great deal of knowledge
and problem solving skills that become key assets for ensuring deterministic product
development schedules. Multi-core processors in particular require specific forms of knowledge
and problem solving skills that existing firms developing single-core and even multiprocessor
designs may not posses. The concept of knowledge and asset specificity to a product is the first
topic that will be addressed in this section.
Dominant designs exist in several forms with computing systems and can reduce the degree
to which new knowledge assets need to be established when moving between technologies, like
operating systems. This section will present the concept of dominant designs which will be built
upon later in the thesis.
Finally, the nature of innovation is essential to this discussion as different types of innovation
have different implications for the firms adopting the innovation. The technology innovation
framework developed by Henderson and Clark is particularly useful in the study of multi-core
processor adoption and will be used throughout this thesis. It is the third concept presented in
this section.
Product Knowledge and Assets
As a firm develops a product, it first builds and then leverages assets (knowledge and
physical) around the product (Williamson, 1985).
Product knowledge can be categorized as an understanding of the individual components of a
system, referred to as component knowledge, and an understanding of how those components
Page 20 of 106
interact to create a desired function, referred to as architecturalknowledge (Henderson, et al.,
1990). In the context of a system containing hardware and software, an example of component
knowledge could be an understanding of a certain software module like a TCP/IP stack.
Architectural knowledge could be the way to balance memory transactions on an external
memory bus to maximize the performance of the TCP/IP code. In this sense, we have an
interaction between the set of instructions executing the TCP/IP stack and the external memory
interface of the processor.
Another key type of knowledge associated with a product design is the set of strategies used
for problem solving (Williamson, 1985). Engineers develop knowledge from solving specific
problems on previous projects that they can apply to new problems they encounter. This
knowledge is beneficial when they encounter similar problems in the future. However, this
knowledge can also be detrimental - when engineers encounter new problems, they may fall
back on problem solving strategies from older problems rather than considering all of the
alternative (and potentially more suitable) problem solving strategies for a new problem
(Henderson, et al., 1990). This concept is particularly important to this thesis because concurrent
programming represents a very different kind of problem yet we see a pattern (to be discussed
later in this document) in which existing solutions are being applied.
Dominant Designs
Several authors have described a common cyclical product innovation pattern that can be
seen across many types of technologies whereby a new technology is introduced that offers the
promise of one or more benefits across certain technology parameters. Murmanna and Frenken's
meta-study on dominant design provides a broad overview of the various areas of research
around dominant designs (Murmanna, et al., 2006). In the early phases of a new technology, the
industry is heavily experimenting with different product design concepts and developing
knowledge. Eventually, a product architecture emerges that the industry gels around, which is
then widely adopted and changes the nature of the competitive landscape. At this point,
companies shift from rapid experimentation to cost reduction around the dominant design, a
transition that fundamentally changes the nature of competition (Utterback, 1994).
Dominant designs can be seen in several forms within computing systems. What is
interesting is that we see dominant designs at various levels of the compositional hierarchy. For
example, we have arguably seen the emergence of a dominant design in the single core
Page 21 of 106
embedded processor space. Despite the fact that processor architectures vary, the design concept
between processors is fundamentally the same - a series of instructions is processed sequentially
and a memory system holds program state information. The process through which instructions
are developed has become highly standardized and the interfaces through which these processors
connect to other components in the larger system have also become highly standardized.
Dominant designs have emerged for other system components as well. For example, DRAMs
have become a price driven commodity business built around a standard architecture 4 . We also
see dominant designs emerging for software components. For example, the multi-threaded
operating system dates back several years and today, there are several types of multi-threaded
operating systems in wide use today (Linux, Microsoft Windows, OS-X and iOS, VX Works,
etc) but the design concept about how threads work has stabilized and processor architectures
have evolved to accommodate this software design. The emergence of dominant designs in
processor and operating system architecture have enabled programmers to migrate software
designs between processors and operating systems without having to completely rewrite them
from scratch to accommodate a radically different processor and/or operating system design.
However, as we will see, multi-core processors represent a departure from these dominant
designs.
Technology Innovation
The characterization of different forms of innovation as they apply to a product's architecture
is particularly important with respect to embedded processor technology and the larger systems
they're a part of. Henderson and Clarke (1990) provide a very useful framework for
characterizing innovation that can occur within a product or system.
Incremental Innovation refers to innovation that improves components within a design but
leaves the architecture intact. This type of innovation fortifies a firm's component and
architectural knowledge. We have seen this type of innovation for years in the processor
industry in the form of increasing clock speeds of processors. The interactions between
components and the processor architecture remains the same and thus existing knowledge is
preserved as well as other assets like software developed for that processor.
4DRAMeXchange provides contract pricing for DRAMs from several
suppliers.
http://www.dramexchan-e.com/
Page 22 of 106
Modular Innovation refers to innovation that changes a core design concept of a module but
preserves the architecture. Modular innovation destroys component knowledge related to the
component where the innovation has occurred but it preserves architectural knowledge about
how the components link together. An example would be a company migrating from a MIPS
based processor to an ARM based processor. These processors have similar interfaces and likely
support the same native language (i.e. C or C++) and operating system (Linux, VX Works, etc).
The processor will execute software and interact with peripherals in a similar manner. However,
an organization will now need to learn the ARM core. Most of the knowledge specific to the
MIPS core they had become familiar with is no longer useful. Component knowledge goes
beyond just the device in this case and also covers the development tools used to program the
device and debugging/problem solving techniques that may be specific to that product.
Architectural Innovation-occurs when design concept of the components is preserved but the
interaction between the components changes. Architectural innovation preserves component
knowledge and destroys architectural knowledge. This is often caused be a change in a
component that results in new interactions.
Radical Innovation occurs when both the design concepts of the components and the linkage
between them are overturned. This form of innovation destroys both component and
architectural knowledge. Radical innovation typically occurs before the formation of a new
dominant design.
A key limitation of Henderson and Clark's four forms of innovation is that they fail to
incorporate the concept of compositionalhierarchy (Murmanna, et al., 2006). For example,
radical innovation within the scope of a sub-system may manifest itself as a modular innovation
at the system level as shown below in Figure 4 if the subsystem interfaces are preserved and the
nature of interactions at the system level doesn't change. This is a particularly important point
within the scope of this thesis because a processor typically represents a sub-system within an
embedded system. A move to multi-core can be considered a radical innovation within the scope
of that subsystem. However, if the subsystem's external system interfaces are preserved, this
radical innovation at the subsystem level is presented as a modular innovation at the system
level.
Page 23 of 106
Modular Innovation at system level
Radical Innovation at subsystem level (within A)
note: external interfaces are preserved
Figure 4: Modular System Level Innovation,RadicalSub-System Level Innovation
In this section, we have examined several important concepts and frameworks that will be
applied throughout this thesis.
Page 24 of 106
4. Embedded Systems
Multi-core processors have been broadly adopted in desktops, laptops and servers over the
last 3-5 years. The current portfolio of processors from Intel and AMD that target these devices
are almost all multi-core today5 . All major operating systems for these devices (Windows 7,
Apple OS-X and Linux) support multi-core processors and applications seem to run as reliably as
they did on single core machines, in many cases with increased system performance.
So why hasn't this transition happened in a similar fashion within embedded systems?
This section provides important contextual information, outlining the key elements of
embedded systems, such as processors, operating systems and programming languages, and their
key attributes, such as diversity, relevant knowledge and design constraints.
Embedded Systems and Embedded Processors
PCs, laptops and servers are considered general-purposecomputers meaning they are
designed to run a broad class of applications. The function of the computer is largely defined by
the software that the computer user is running on it.
Embedded Systems on the other hand, are generally classified as a system that contains a
processor and is designed to deliver a specific set of dedicated functions. Embedded systems can
be extremely simple like the controller for a microwave, which is typically powered by a simple
8-bit microcontroller. They can also be highly sophisticated, such as multichannel wireless
processing systems within a cellular base station. An embedded system may also be part of a
compositional hierarchy and thus a component within a larger system. For example, modem
automobiles contain several embedded systems to control various elements of the vehicle: from
cruise control to lane departure warning systems to the timing of the engine itself.
Embedded systems are used in all of the major electronics segments including consumer,
communications, automotive, industrial, instrumentation, healthcare, military, and aerospace.
Even general purpose computers contain smaller embedded systems to control things like the
power supply or the DVD drive, for example.
s Intel Processors: http://www.intel.com/products/desktop/processors/index.htm
AMD processors: http://www.amd.com/us/products/Pages/processors.aspx
Page 25 of 106
2008Microcontroller Market Revenue=$13.7 Billion USD
GeneralPurpose
16%
40%
4%
Consumner
arnt
12%
Figure 5: 2008 MicrocontrollerRevenue by Market (Databeans,2008)
The first key attribute that distinguishes embedded systems from general purpose computers
is that they are extremely diverse.
Embedded systems are typically powered by a class of processors known as embedded
processors. Embedded processors vary widely in their capabilities based on the applications
they serve and thus are equally diverse. There are several classes of embedded processors:
microcontrollers (MCUs), digital signal processors (DSP), microprocessors, System-on-Chip
(SoCs) and more.
Microcontroller (MCU): A processor that has a rich variety of on-chip peripherals that are
optimized for specific system functions.
Microprocessor (MPU): In contrast to a microcontroller, a microprocessor is a more
powerful processor typically designed for general purpose functions.
Digital Signal Processor (DSP): A DSP is a type of microcontroller that is optimized for
real-time and computationally intensive applications. DSPs are used in audio & video
processing, wireless communications and system control applications.
System-on-Chip (SoC): An SoC is typically a collection of processor cores and dedicated
hardware optimized for different tasks.
A large majority of the processors shipped world-wide each year are embedded processors.
In 2010, there were 9.01 billion units shipped of embedded processors according to VDC
Research (2011). In 2009, there were 308.3 million PC units shipped according to Gartner
research (2010) and 6.6 million server units shipped according to IDC (2010). While some
classes of servers and PCs utilize multiple processors, the embedded market is still at least an
order of magnitude larger in size in terms of unit shipments.
Page 26 of 106
In contrast to the heterogeneity of embedded systems, the general purpose processors that
power our PCs, laptops and servers are comparatively homogeneous. Migrating an application
from a laptop with an Intel processor running Windows to a laptop with an AMD processor
running Windows is a trivial process. While there may be subtle performance differences, the
application will run without needing to be re-architected or recompiled.
In the embedded systems domain, migrating from one processor to another is not, however, a
trivial process because the processor architectures, instruction sets and capabilities vary to a
much greater degree. Some amount of work is almost always required on the software to move
between processors. If migrating between two processors with the same instruction set from the
same vendor, the changes may be smaller but if switching vendors and instruction sets, the
changes can be significant.
Embedded Operating Systems
The majority of embedded systems run an operatingsystem just like general purpose
computers. An operatingsystem is a piece of software that helps to streamline application
development by providing a set of common system functions and managing hardware that
applications can be built upon. Operating systems can provide a wide variety of functionality
via an abstraction layer and programmers can quickly leverage functions such task management,
manage memory, system resources, user interface components, file systems, networking,
security, device management, and more rather than implementing them from scratch.
In the general purpose space, Microsoft Windows variants, Apple's OS-X and Linux variants
power the majority of desktop and server applications.
1%
1%
1%
% -Mndows
35.40%- Mac
2.24% - ios
25
1.01% - Jav ME
0.94% - Lnux
S.66% - Anrom
Figure 6: OS market sharefor desktop computers6
6 From
netmarketshare.com: http://netmarketshare.com/operating-system-market-share.aspx?qprid=8
Page 27 of 106
In the embedded space, about 70% of embedded systems use an operating system (UBM/EE
Times Group, 2010). Just like embedded systems and embedded processors, there is much
greater diversity in the breadth and nature of these embedded operating systems, again due to the
diversity of problems for which embedded systems are designed to solve. For processors that are
managing a diverse set of tasks like networking, user interface applications or system control, a
larger operating system that has more building blocks for these types of functions -- like Linux or
Windows CE -- may be more appropriate. For a device performing a fixed function like audio or
video processing, it can potentially use a very small operating system like FreeRTOS or uCOS
II. Figure 7 below shows the most commonly used embedded operating systems.
.a
um (e Nc
ftO-
Tivitm
E
ahMMsra
a teps
d
st
cm tnscb ndo
Mam~b~
howya
Figure 7: Most commonly used operatingsystems in embedded systemsd
Sysemnd
rocsso
Diegsty~
With such a diverse set of operating system available and considering that 30% of embedded
systems don't use an operating system, migrating embedded system designs between embedded
operating systems can be time consuming.
System and Processor Diversity
The first important attribute is diversity. Embedded systems themselves are highly diverse
and so are the processors and the operating systems that power them. Unlike the general purpose
computing space which has been dominated by a few processor architectures (Intel and AMID)
and operating systems (Windows, Linux, OS X), there are a number of companies developing a
number of variants of embedded processors today which service both broad and narrow market
segments demanding a great diversity in functionality, connectivity and performance. This
diversity means that migrating between embedded processors and embedded operating systems
often involves software rework.
7(UBM/EE Times Group, 2010)
Page 28 of 106
More Limited Resources
General purpose programmers developing x86 applications to run on Windows, Linux and
Mac platforms enjoy a number of conveniences that programmers of embedded systems do not
share. They rarely need to take into account memory limitations because modem PCs and
servers have so much of it. Furthermore, virtual memory provides an abstraction layer that
allows all applications to allocate and access massive amounts of memory without any
knowledge of the actual hardware configuration. Embedded programmers on the other hand,
typically have much smaller memories to work with and in many cases, need to manage this
memory more carefully and manually, taking into account the actual hardware configuration.
On-chip memory can be a dominant percentage of the die area for a product and has a direct
impact on the cost of chip. More on-chip memory means a larger die size for the chip, which
increases the manufacturing costs for the suppliers. By constraining their applications to a small
size, embedded systems programmers can also fit them into less expensive processors thus
bringing down their own system costs. The same tends to be true for clock speeds. A processor
supplier may yield a small percentage of processors that run at a higher frequency which they
sell at a premium. Embedded systems engineers can save cost not only by using lower-memory
variants of processors but also by using slower variants as well.
Platform "Stickiness"
By optimizing for specific processors, firms' product designs for embedded systems can
become more coupled to their processors. Essentially, embedded systems programmers are a
faced with a key tradeoff in their designs between unit cost and design costs. By keeping their
software in a high level language like Cand in a portable, modular structure, they can more
easily move between processor platforms, reducing design costs. However, by optimizing their
application for a platform, they may be able to reduce system cost by fitting into a less expensive
processor, thereby lowering unit cost. A common example of this is hand-optimizing pieces of
code in the native assembly language of the processor. The result is the software becomes more
tightly coupled to the processor platform and thus the platform becomes sticky within the firm.
Code reuse is an extremely common practice in embedded system design. In 2010, 78% of
embedded projects reused code that was developed in-house (excluding commercial off-the-shelf
and open source software) and only 14% of projects reported no reuse at all (UBM/EE Times
Group, 2010). If a new project is to reuse existing software, and that software was optimized for
Page 29 of 106
a specific processor, there will be additional incentive to keep the processor platform constant in
the new design.
Tacit knowledge also contributes to the stickiness of a processor. Because processors vary
so widely in their architecture and more importantly, their design and debug tools, firms must
learn how the processor architecture functions, how the development and debug tools work and
how to solve problems. This tacit knowledge is built up over time. At the beginning of the
learning curve, the firm is wrestling with new types of problems that may result in slipped
schedules and sub-optimal system performance in the first product built around a new processor.
However, once the tacit knowledge is established, it becomes an important asset - the firm has
confidence that they can reliably and predictably develop products around a processor platform;
this has a great deal of value when the time-to-market of the end product is important.
There are several emerging modern programming languages that have been specifically
designed for the programming challenges of multi-core and multiprocessing systems (Patterson,
2010). In the embedded world however, most projects are still written in C. The C
programming language was developed by Bell Labs in 1971 (Richie, 2003) and in 2011, 62% of
embedded system projects still use C as the primary language and this proportion has been more
or less constant for the last 5 years (UBM/EE-Times, 2011). While adoption of these languages
could help make engineers in embedded systems more productive when developing multi-core
designs, they're not on the radar in the embedded space (UBMIEE-Times, 2011).
Product Life Cycle & Legacy
In several embedded market segments, the products which embedded systems are used in
may have very long life cycles. It is not uncommon in industrial and military applications for a
product to be on the market for a decade and the original model still enjoying sales ten years
after its launch. An organization must typically preserve several knowledge assets related to
these products while they're in production (and for several years following). Firms in these
market segments with longer product life cycles must incorporate the risk of longer term
obsolescence in their product selection criteria. They need to ask not only whether the processor
will still be supported and sold in 5to 10 years but also if the company is likely to survive for 510 years. Since the firm will need to maintain its tacit knowledge for its existing products, there
is an additional incentive to develop new products that will leverage their existing tacit
knowledge that they need to maintain anyway.
Page 30 of 106
Real Time Constraints
A real time constraint or capability means that a system needs be capable of responding to
certain types of events within a predefined amount of time. Hardreal time is used to describe
real time requirements that, if violated, may result in a system level failure. For example, an
embedded system controlling airbag deployment in a car will need to deploy the airbag within a
certain amount of time after the collision is detected for the airbag to provide protection. Soft
real time describes real time constraints that, if violated, may result in decreased system
performance but not failure. For example, a system that decodes a video stream will need to
decode each frame in a certain amount of time. If a single frame isn't decoded in time, the video
may glitch but the system will be continue to run.
Around 75% of embedded systems have some form of real time constraints (UBM/EE Times
Group, 2010) and many systems have hard real time constraints. Designing systems with real
time constraints and particularly hard real time constraints can be very challenging. If a
processor is running an operating system and handling several different tasks, it may need to
rapidly switch from one task to another to respond to an event associated with a real time
constraint. Operating systems can be classified as a real time operating system (RTOS) if they
can support fast task switching to respond to system events in a deterministic amount of time.
If a system has been tuned to meet certain soft and hard real-time constraints, this may be yet
another impetus to stick with the existing software / hardware design if possible.
Platform Certification
In cases where the end product will communicate over a network or be used in a safetycritical application like automotive or healthcare, there may be a certification process that the
product must go through before it can be commercially sold. Mobile phones, for example need
to go through certification to prove that they comply with wireless standards and won't
negatively impact the wireless networks they'll be part of. Changing software and processor
platforms often require recertification which costs time and money.
Summary
To summarize, embedded systems are extremely diverse and so are the processors and
operating systems that power them. There is also a great deal of inertia around processor
platforms, development tools, operating systems, real time performance and more. This inertia
Page 31 of 106
emerges from the diversity of the technology, the switching costs associated with changing
processors and tools, certification status, the level of reuse and the product life spans. This
inertia can even inhibit the adoption of processors that may only require incremental changes to
existing software.
For more radical technologies like multi-core processors, this inertia is even more powerful,
particularly because multi-core processor design requires very new types of knowledge.
Programming parallel applications requires very different skills, particularly when the
parallelism is realized in high levels in the structure of the application.
Page 32 of 106
5. Forms of Parallelism & Concurrency
To understand why multi-core programming requires a very different type of knowledge, it is
important to understand how parallelism, inherent in multi-core processors, is typically
implemented in processors and to what extent programmers need to manage and architect their
applications around this parallelism.
Granularity
The level at which parallelism is implemented within a system is often described using the
qualitative measure of granularity. Granularity in the sense of parallelism can be thought of as
the size of the task utilizing a processing element (Bhujade, 1995). It can also be thought of as
the amount of work done between synchronizing events between parallel processing resources
(Tabirca, 2003). Coarse granularity refers to the allocation of large amounts of work to a
processing element whilefine granularity refers to the allocation of small units of work to a
processing element.
An example of coarse parallelism could be the allocation of entire applications to different
processing elements. For example, on a dual core processor, we could run a web server on the
first core and a data collection/analysis application on the second core. An example of fine grain
parallelism could be splitting the left and right channels of an audio processing algorithm across
two computation units within a processor core.
Fine grain parallelism typically occurs within low-level modules of an application and can be
managed at the component or module level. However, coarse grain parallelism occurs at higher
levels and must be addressed at the architectural level. Multi-core programming is
fundamentally coarse grain parallelism which is why, as we'll see, it has more to do with
architectural optimization than modular optimization.
Types of parallelism
There are two fundamental forms of parallelism. Data parallelism is the capability to
operate on multiple pieces or streams of data in parallel. Task or instruction parallelism refers
to the capability to run multiple independent instructions simultaneously. Parallelism can also be
implemented at various levels in the processor design hierarchy. Bit level parallelism, for
Page 33 of 106
example is very low level, often transparent to the programmer and affects memory reads and
writes. Thread or task parallelism, on the other hand, exist at higher levels within the software
architecture and tends to be more heavily managed by the programmer or the operating system.
Bit Level Parallelism
For a processor to perform an operation on a word that is larger than the native word length
of a machine and the memory bus, the processor needs to perform multiple accesses to memory
to retrieve the individual components of that word. In the case of a 16-bit processor performing
a 32-bit operation, for example, the processor would need to first fetch the lower 16-bits of the
word and then the upper 16-bits of the word. Bit-level parallelism means that the width of the
data buses is increased to reduce the number of cycles required to fetch words that are larger than
the native word length. This had been the dominant form of parallelism found in general
purpose processors until about 1986 when 32-bit bus width became more main-stream (Culler, et
al., 1999). However, bit level parallelism is still commonly used in the embedded space where 8
and 16-bit processors are widely sold.
32-&tand
Abve
an1twd
Shoo
34%
37%-
Figure8: 2008 MicroprocessorRevenue by ProcessorType (Databeans,2008)
Bit-level parallelism is automatically handled by most processors. On many modem
processor architectures, a programmer typically doesn't need to instruct the processor to fetch
two 16-bit words. They can perform a 32-bit read and the processor will automatically perform
two sequential reads of 16-bit values over a 16-bit memory bus.
Instruction Level Parallelism (ILP)
Instruction Level Parallelism (ILP) means that a processor is able to execute multiple
instructions in a single cycle.
ILP can be implemented in a serial fashion whereby instructions are broken into smaller
pieces of work that can be executed at a faster rate. This technique is known as "pipelining" and
is commonplace in many modem processors.
Page 34 of 106
A common analogy to describe a pipelined architecture is a factory assembly line. Imagine a
single worker performing ten tasks and each task requires an equal amount of time to
accomplish. Now imagine hiring nine more workers and lining them up so that each worker
handles one of those tasks. The result is that it takes 10 steps to complete the work but the
throughput is now ten times greater.
ILP can also be implemented in a parallel fashion whereby a CPU may be able to execute
more than one instruction at the same time. Flynn's taxonomy provides a useful means of
categorizing the four types of parallel operations at the instruction level (Flynn, 1972).
SISD (Single Instruction, Single Data) - a SISD processor is capable of executing a single
sequence of instructions and operating on a single data stream. Most embedded processors
which target simpler control applications utilize SISD architectures.
SIMD (Single Instruction, Multiple Data) - a SIMID processor is capable of executing a
single sequence of instructions on more than one data stream. SIMD is very useful in
applications which contain multiple independent datasets that need to be processed in an
identical fashion. Audio and image processing lend themselves well to SIMD, for example.
When processing stereo audio, there are two data streams - left and right channels. A SIMD
capable processor can execute one instruction and operate on independent data from both the left
and right channels. For a video application, a SIMD processor could operate on several pixels
simultaneously. In fact, the Intel MMX instruction set extensions are basically a SIMD engine
with 8 sets of computational units (intel, 2000).
MIMD (Multiple Instruction, Multiple Data) - A MIMD processor is capable of executing
more than one instruction per cycle and operating on more than one data stream. A common
example of a MIMD implementation is a superscalar processor which is still considered a single
CPU but is able to dispatch instructions to multiple computation blocks.
In the author's experience, flexible superscalar architectures can often be programmed fairly
efficiently from high-level languages if the programmer understands the architecture, compiler
directives and intrinsic functions to maximize the performance.
Both pipelining and superscalar became popular in the 1990s (Culler, et al., 1999). The
Pentium Processor was Intel's first processor that supported both pipelining and superscalar
processing.
Page 35 of 106
MISD (Multiple Instruction, Single Data) - a MISD processor is capable of executing
multiple instructions on a single stream of data. There aren't any mainstream MISD processor
architectures on the market; however, systems can be configured in a manner which reflects this
architecture. For example, a safety critical application may have two processors operating on a
single stream of data. The output from each processor is compared and if the results differ, it
indicates a system error (software or hardware) has occurred in one of the processors and the
output data is invalid. In some cases, firms will use two different processor architectures to
operate on the same data to protect against a situation in which a silicon anomaly on the
processor results in the calculation of matching but erroneous results.
For the programmer, instruction level parallelism within the processor is useful for finegrained parallelism in the application and can be managed by directing the compiler via compiler
directives. Programmers who understand the architecture may be able to achieve higher
performance. The manner in which data is organized for example, may allow for more
aggressive use of instruction level parallelism. A SIMD processor like Analog Devices SHARC
processor can operate on two data streams if the data is interleaved in a single array. However, if
there are two distinct arrays holding the two data streams, an architectural decision that may be
more intuitive and clean to the programmer, the processor is unable to utilize both computational
units because the memory system cannot support loading operands from non-interleaved
memory into the core in a single cycle. Also, most compilers provide extensions via #pragma
statements that allow the programmer to provide additional information to the complier about the
nature of the data so a compiler can safely parallelize certain regions of code.
Instruction and bit level parallelism are typically confined within low-level software modules
and may improve the performance of that module but don't typically impact the system
architecture.
Task Parallelism & Operating Systems
Task parallelismrefers to the distribution of coarse grain tasks at the operating system level
across processor cores in the system. In contrast to instruction and bit level parallelism, task
parallelism can exist at both the modular level for finer-grain parallelism and architectural level
for coarser-grain parallelism. A single programmer may be able to manage instruction and bit
level parallelism but on larger systems, coarse-grain task level parallelism may affect several
programmers due to its coarse grained nature.
Page 36 of 106
There are three very important elements related to task-level parallelism.
Dominant Design For Operating Systems
The first element is that a dominant design has emerged around the way task-level
parallelism is implemented within operating systems. The design paradigm consists of
processes and threads. A process is a set of instructions that make up a single instance of a
program; a multitasking operating system can support several processes. A process is composed
of one or more threads. A thread is similar to a process in that it is a set of instructions but
threads running within a process share the same memory space. A process may consist of a
thread that manages user interface interactions and a second thread that manages data processing.
The user interface thread can store state information related to the user interface in this shared
memory space which the data processing thread can then read from.
Different types of thread and processes and thread configurations are shown in Figure 9
below.
n-
,
Figure 9: Threads andProcess - Possible Configurations?
Applications built upon a collection of threads are referred to as multi-threadedprograms;
monolithic applications that aren't broken down into threads that the OS can manage are referred
to as single-threadedprograms.
Reciprocal Support in Processors
This design has become prevalent enough in both embedded and general purpose computing
that functions to facilitate this design have been implemented in processors like hyper threading
and symmetric multiprocessing (SMP). This is the second key element related to task-level
parallelism.
8Image from: http://www.cs.cf.ac.uk/Dave/C/node29.html
Page 37 of 106
SMP, which is covered in more detail in the next section, allows the threads that make up a
process to be distributed across multiple cores either manually by the programmer or
automatically by the operating system. This is very powerful because it means that programmers
can follow the design practices they used for programming multi-threaded applications for
single-core processors and the operating system can make decisions about on which core to run a
given thread in real time. Hence, applications don't necessarily need to be completely
repartitioned to run on a multi-core device.
Writing multithreaded applications for multi-core processors does introduce several new
challenges for programmers.
Code becomes more difficult to predict than a traditional single threaded application because
a preemptive operating system may interrupt threads at arbitrary points and switch to other
threads.
"Afolk definition of insanity is to do the same thing over and over again and to
expect the results to be different. By this definition, we in fact require that
programmersof multi-threadedsystems be insane. Were they sane, they could not
understandtheirprograms." - EdwardLee (2006)
Software modules that utilize synchronization locks are not composable meaning that two
correct pieces of code cannot be combined to form a single, larger piece of correct code (Lee,
2006). The implication is that as we attempt to combine software using locks into larger pieces
of software, new failure modes will emerge. This can be particularly troublesome if some of
these pieces of software originated outside of the organization and source code isn't available.
Common failure modes with locks are conditions called deadlocks and livelocks. A deadlock
can occur, for example when a thread on one core has locked one resource and is attempting to
lock a second. On another core, a thread has locked the second resource and is attempting to
lock the first resource. At this point, neither thread can proceed because they are both waiting
for the other thread to release a resource.
ThreadI
2
Thread
void update1()
void update2()
acquire(A);
acquire($); <<<
variab1le++x
release(a)I
release(A)t
acquire()
Thread I
waits here
acquir.(A); <<< Thread 2
waits her*
variable1++)
release(B):
release(A);
Page 38 of 106
Listing 1: Example of a scenario that will lead to deadlock9
A livelock, on the other hand, occurs when two threads enter an unending loop of acquiring
and releasing. Livelocks can occur when attempting to write code to prevent deadlocks (Gove,
2011).
Application Portability Across Operating Systems and Architectures
The third element is that this design paradigm makes it easier to port applications between
operating systems which are built around this design. Furthermore, the POSIX (Portable Open
System Interface) standard has emerged which provides a common threading API. This further
improves the portability of applications across not only single core architectures but in some
cases, multi-core architectures.
The POSIX standard was defined in 1988 and aimed to improve the portability of
applications across platforms by providing several common functions to Application
Programming Interfaces (APIs) (The Open Group, 2006). Its intent is to make porting a POSIXcompliant application from one POSIX-compliant operating system to another, extremely easy.
Several of the top embedded operating systems shown in Figure 7 on page 28 are POSIX
compliant including QNX Neutrinolo, VXWorks", and Nucleus.
While Linux is not fully
POSIX-compliant, it is highly compliant and POSIX compliant applications can be typically be
ported to Linux with little or no modification (Locke, 2006).
The POSIX API covers several different types of functions as shown in Figure 10.
9 (Gove, 2011)
'0 http://www.qnx.com/news/pr 959 L.html
" http://get.posixcertified.ieee.org/cert prodlist.tpl
12 http://www.mentor.com/companv/news/nucleusposix07
Page 39 of 106
stadard New Document SO N-ercszi
t
POSIX1 1003.1
POSIXI
O101
POS1X2 1003.1
POSX.3 1003.1
POS14 1003.1b
amicloelfaces
9945-1
Exte10ons
ISO
C
Testmmdhods
Redtime
POSiX.4a 1003.lc
Ieads
POSX4b1003.1d
POSIX5 1003 1
POSIX6 1003.1e
POSIX.7 1003.1
POSk.8 1003.1f
POSIX.9 1003.1
POSIX10 1003.1
More Reie
ADABieda
POSIX.1211003.lg
Socet
POSDL1311003.1
POSIX 15 1003.1
POSIX17 IEE 1224.2
Real TmneProfes
Ba-ch/suprcomu er extensions
NetwokDrectory/amme savices
Seciky
SystM Adnisaraion
NetwaxkFleAccess
77 Bindg
Supercoing
Figure 1O:POSIXAPIs13
A subset of the POSIX standard addresses the standardization of threads (1003. 1c) called
POSIX threads orpthreads. The pthreads library contains around 60 functions for thread
management. The API includes functions for creating and destroying threads as well as
functions for passing data between threads and synchronizing the threads (Gove, 2011).
As noted earlier, pthreads can be used on single core and multi-core processors and have
become a "standard commodity" in multi-core applications development (Freescale
Semiconductor, 2009).
Summary
To summarize, task and bit level parallelism are common on single-core processors and
usually implemented at low levels in the system so the improvements are modular in nature.
Task level parallelism is implemented at both the modular level (fine grain parallelism) and at
the architectural level (coarse grain parallelism).
Task level parallelism is implemented in a common fashion across most operating systems today,
which allows for greater portability of applications between operating systems built around this
design. Furthermore, when properly supported by the operating system and the hardware,
applications can be ported between single core and multi-core devices without fundamentally
rearchitecting the software systems. While this solution doesn't apply universally as we'll see in
subsequent sections, it is a critical component in the adoption patterns.
"3http://www.comptechdoc.org/os/linux/programming/c/linux
pgcintro.html
Page 40 of 106
6. Attributes of Multi-core Processors
In the last section, we examined forms of parallelism in processor architectures and how task
parallelism has been standardized across operating systems. The architecture of a multi-core
processor has significant impacts on the challenges associated with adoption. As noted earlier,
reciprocal functions in the processor like SMP can ease the challenges of adoption. However,
it's not universally applicable and has fundamental limits.
This section will examine several important architectural attributes of multi-core processors
that differentiate them from single core devices and from each other, such as the type and
number of cores, how resources are shared and the architecture of shared resources, and how
multi-processing takes place. Each of these attributes directly affects the challenges associated
with adoption.
Number of cores
By definition, a processor which contains at least 2 cores is a multi-core processor. Multicore processors that have on the order of tens, hundreds and even thousands of cores in a single
package are known as many-core processors.
Types of Cores - Homogeneous and Heterogeneous
A system that contains multiple cores is said to be heterogeneous if the processor core
architectures and instruction sets of those cores are different. Conversely, a system is called
homogeneous if it contains multiple identical processors cores.
Homogeneous processors may contain 2, 4, 8 or even 100 processor cores with same
instruction set architecture and can be applied to general purpose computing applications or
embedded applications which need higher levels of performance than a single core device can
provide.
Heterogeneous processors can often provide a more optimized platform in that different types
of processor cores are optimized for different functions within the system. Many modem
System-on-Chip (SoC) devices utilize a heterogeneous collection of cores and/or hardware
accelerators which are each optimized for different system tasks.
Page 41 of 106
The Intel media processor SoC shown in Figure 11 is designed to power DVD players, TVs
and cable set top boxes and it contains several different types of processors that are optimized for
the different functions that this device needs to perform. For example, a dedicated DSP is used
for audio processing while dedicated display and graphics processors are using for various video
functions. The Pentium M processor likely controls the entire system and elements like the user
interface, networking stacks, etc.
I-i
Emi
Figure 11: Intel Media ProcessorCE 3100 Bock Diagram"
Several existing embedded systems already use a heterogeneous collection of processors
typically because certain tasks within an embedded system can run significantly more efficiently
on one type of processor over another. In a heterogeneous processor architecture, each processor
will typically run its own, unique application and the degree of modularity at the system level is
typically high - there is generally a significantly lower level of interaction between the
processors compared to the level of interaction between software modules running on a
processor (Levy, 2008).
Resource Sharing
Processor cores can be collocated on a single, monolithic silicon die - these are referred to as
chip multiprocessors (CMP). In a CMP configuration, the cores may share several of the same
resources including I/O peripherals, memory, and more.
Figure 12: Apple A5 dual-coreARMA9'
14 http://download.intel.com/design/celect/downloads/ce3
5
100-product-brief.pdf
15http://www.abdulrehman.net/wp-content/uploads/2011/03/Apple-A4-and-A5.ipg
Page 42 of 106
Multiple processor cores can also be on different silicon die but integrated into the same
package as a multi-chip module (MCM) as Intel did with their Yorkfield processor shown in
Figure 13 below.
Figure 13: Intel "Yorkfield" Quad-core MCM 6
Resource sharing can have a significant impact on application performance on a multi-core
processor and it can also affect the scalability to a greater number of cores. For examples, if all
cores share a common external memory bus, the amount of bandwidth that each core has
decreases as the number of cores increases.
Memory Architecture
The memory architecture of a multi-core processor can vary widely and can have a
significant impact on the application performance if not understood or exploited properly.
A key characteristic of memory is latency-- the time required for a processor to perform a
read from the memory system. A memory system with a low latency can be read and written to
faster than a memory system with high latency.
As memory is located further from a processor core, the latency increases because it
physically takes more time for the control and data signals to travel between the core and the
memory system and for a value to be returned from memory. When memory is located off chip,
it typically has a much higher latency than memory located on chip. On larger general purpose
processors, external memory accesses can require hundreds of processor clock cycles. For
example, an AMD Athlon 64 at 2.0GHz with DDR-400 memory has a memory latency of 50ns".
At 2GHz, the cycle time of the processor core is 0.5ns which means the core waits around 100
cycles for a read to complete. An example we will examine below, utilizes an ARM Cortex-A9
which has an external latency closer to 20 cycles.
External memory latency is particularly important for multi-core processors because these
devices typically have one external memory bus that the cores share. As noted above, shared
resources can degrade performance. For example, if 2-3 cores try to access external memory
16 http://hothardware.com/articleimages/Item1289/small
17 http://www.anandtech.com/show/1
Intel-08200S-0955S-yorkfield.ipg
610/6
Page 43 of 106
simultaneously, some cores will need to wait for several hundred clock cycles for the access to
complete.
The memory system is typically organized in a hierarchical structure - there is a small
amount of on-chip memory close to the processor that can be accessed at the clock speed of the
processor which is referred to as Li memory. Some processors also have a larger amount of onchip memory located further from the core with a longer latency called L2 memory. Most
processors also support external memory which is sometimes referred to as L3 memory.
Some types of embedded processors that are designed for a broad variety of application tasks
like ARM, MIPS and PowerPC families of devices, utilize Li and L2 as cache memories. Some
embedded processors that are more optimized for real-time and signal processing applications
may offer the flexibility of using LI memory as either cache or random access memory. The
Blackfin processor from Analog Devices, for example, allows Li to be configured either as
cache or RAM18 . This option exists because it isn't always necessary to use external memory in
an embedded system - applications can be small enough to fit within the internal RAM of the
device.
Faster layers of memory closer to the processor core serve as cache for slower layers of
memory further from the core. If the core fetches a value from external memory, it can also be
stored in LI cache. When the processor needs to operate on that value again, it can access the
value in LI cache which can be significantly faster than L3. The diagram in Figure 14 shows the
memory hierarchy for a system with an LI and L2 cache and external L3 memory along with the
example latencies.
Onchip
Procorssor'
L1 Cache
H
L2 Cache
Offchip
L3 Memory
1 cycle
-- -- 7cycles
r
----
20cycles
Figure 14: Memory access times (core clock cycles)for different levels ofmemory 9
Consider an ARM Cortex A9 with 32KB of Li cache, 512KB of L2 cache and external
SDRAM operating at 100MHz (IOns cycle time). LI cache can operate at the clock speed of the
" Blackfin Datasheet contains information about configuring LI memory as SRAM or cache:
http://www.analog.com/static/imported-files/datasheets/ADSP-BF53 1_BF532_BF533.pdf
'9 Adapted from a diagram in (Gove, 2011)
Page 44 of 106
processor. As can be seen in Figure 15 below, data sets smaller than 32KB can fit in Li cache
and can be accessed with a latency of IOns once they've been cached. As the data set increases
beyond the size of Li cache, the L2 cache is used which results in larger latencies. At a certain
point, the dataset is larger than both caches and the benefits provided by cache are no longer
perceptible and most accesses to external memory incur the full 200ns latency.
250
200
Additional
Latency
(cycles)
150
-0-4
100
-2
0
4
16
64
256
1024
4096
SizefKej
20
Figure 15: Memory Latency on ARM Cortex A9
Shared Memory
Common memory architecture for multi-core processors is sharedmemory architecture,
which follows the hierarchical paradigm presented above. In this configuration, several
processors can uniformly access the same memory via the same memory addresses. This
approach provides several advantages. It simplifies the programming model because each core
can operate on the same dataset by placing the data set in a shared region of memory rather than
creating redundant copies of the data set and placing these in each core's local memory. Shared
memory is also a means for tasks running on different processor cores to communicate with each
other. For programmers migrating from a single core design, tasks now spread across different
processor cores can share memory just as they did in the single core design.
20
http://www.ruhr-uni-
bochum.de/integriertesysteme/emuco/files/System
f
Level Benchmarking Analysis of the Cortex A9 MPCore.pd
Page 45 of 106
Figure 16: SharedMemory Architecture
In many multi-core systems, the processors also have their own local cache memories as seen
in Figure 17. If Core 0 has retrieved a value from memory, it will be cached in its local cache
memory. If Core 0 then modifies this value, the value stored in cache no longer matches the
value stored in memory. If Core 0 were the only processor in the system, the cache could
provide a write back operation at some point in the future to synchronize the memories.
However, in a multi-core system, Core 1 may access the 'old' value in memory on the cycle after
Core 0 modifies the value in its local cache. A solution to this problem is developing a
mechanism to allow the cache memories to synchronize.
Cache coherency is an architectural
feature that allows processor cores with local cache memories to maintain coherency across
memory.
Core 1
Core0
Cache
:
Cache
Shared Memory
Figure 17: SharedMemory Architecture with Cache Coherency
Cache coherency in shared memory systems improves the performance of a shared memory
system by adding the benefits of caching. However, the challenge associated with keeping the
cache memories synchronized grows rapidly as the number of cores increases. Intel refers to this
as the "coherency wall" - an increase in resource sharing can have adverse performance effects
as the number of cores is increased (Matson, 2010). Today, most embedded processors with a
shared memory architecture don't support coherency across more than 4 cores. For example, the
ARM Cortex A9 as well as the forthcoming ARM Cortex A15, for example, can support
memory coherency across 4 cores21 .
21
ARM Cortex A9 Specs: http://www.arm.com/products/processors/cortex-a/cortex-a9.
hp
ARM Cortex A15 Specs: http://www.arm.com/products/processors/cortex-a/cortex-a 15.php
Page 46 of 106
Distributed Memory
The distributedmemory architecture is one in which processors each have a private local
memory and the cores are linked via an on-chip network. A key benefit of these architectures is
scalability in part because the coherency wall doesn't apply because cache memory doesn't need
to be synchronized between all of cores. However, these architectures can be more difficult to
program than shared memory architectures for a few reasons. Memory isn't implicitly shared
which means that data needs to be explicitly managed. Also, communication between cores
becomes more explicit as well. With a shared memory model, one core could write to a shared
location and several other cores could read that value. In a distributed model, the first core needs
to explicitly communicate with each core.
Loala
Local
Mem orvMemory
[E
.
Core3
Figure 18: DistributedMemory Architecture
Hybrid Variants
The shared memory and distributed memory architectures are two extremes and there are
hybrid implementations that borrow from each. In some forms of distributed memory
architectures, the cores can access shared memory over the on-chip network. Tilera utilizes a
distributed cache architecture using a mesh-based distributed memory architecture that supports
cache coherency between groups of cores 2 .
Multi-Threading
As the number of processor cycles required to complete external memory accesses has
increased, processor cores can spend more time simply waiting for data reads to complete. On a
typical single core processor running a preemptive operating system, processes and threads share
time slots on a single CPU. When a process / thread performs an external memory read, the
entire system must wait for that operation to complete.
22
http://www.tilera.com/sites/default/files/productbriefs/TILEPro64
Processor PBO19 v4.pdf
Page 47 of 106
Vertical multi-threadingis an architectural technique to make more productive use of cycles
spent waiting for a long latency memory operation to complete. A processor with multithreading capabilities has two or more sets of registers to store thread state. While one thread is
stalled due to a long latency memory read, the other thread can execute.
An architecture which supports Simultaneous Multi-threading(SMT) can support concurrent
execution of multiple threads. The figure below shows the difference between single threaded,
vertical multi-threaded and simultaneous multi-threaded execution.
Single thread timeline
Process 1finished
Thread
Process2
Mnraryas
Process 2 finished
MA
m
Proces2
Multi thread timeline (SPARC64" Vi): VMT
Process 1finished
Threadl
memo
Thread12
Procse2
MoImmync00s
Process
Process 2 finished
Muli tred teline (SPARC64" VWVII+) : SMT
Process 1 finished
Threadl
Process 2 finished
Thread2
Pmcss2
Memoryaus
Prooss2 I
Figure 19: Single Threadingvs. Super-threadingvs. SimultaneousMultithreading"
A processor architecture which supports SMT, presents itself as a multi-core architecture to
the operating system despite the fact that there may be only one core. A dual core processor that
can support 4 threads per core will appear to the operating system as an 8 core processor.
Asymmetric Multiprocessing (AMP) and Symmetric Multiprocessing (SMP)
Asymmetric multiprocessing(AMP or ASMP) configuration is one in which cores run their
own operating systems and applications and are controlled independently, much like single
processor systems as shown in Figure 20 below. The programmer must explicitly manage the
sharing of resources, the communications between the cores (Leroux, et al., 2006). AMP is
good for existing, monolithic legacy code where a legacy application can placed in one core and
an operating system running in a second core.
23 http://www.fuitsu.com/global/services/computin/server/sparcenterprise/technology/oerformance/processor
html
Page 48 of 106
CPU 0
CPU 1
Shared Memory
System
InterconnectOS1M
mr
or
Cl nt 1 re
24
Figure 20: AMP Configuration
In contrast, a symmetric multiprocessing(SMP) configuration is one in which a single
instance of an operating system is run across several cores and the operating system controls all
system resources like memory and 1/0. In this configuration, processes and threads can be
assigned to different cores dynamically, based on loading levels of the different cores or
statically by the programmer.
POSIX-
POSIX-
POSIX-
Compliant
Application
Compliant
Application
Compliant
Application
Operating System
POSIX-Compliant with SMP Support
CPU
CPU
System
Interonnec~t
Cont roller
Memory
25
Figure 21: SMP Configuration
24
25
Adapted from a diagram in (Leroux, et al., 2006)
Adapted from a diagram in (Leroux, et al., 2006)
Page 49 of 106
Essentially what this means is that a POSIX-compliant multithreaded application can be
ported to a POSIX-compliant SMP operating system and take advantage of multiple cores often
without requiring a fundamental reachitecture of the system which will be explored in the next
chapter of this thesis. There are still several challenges that exist for programmers but in many
cases this allows them to preserve most of their existing knowledge and architecture and ease
into multi-core development.
Several mainstream embedded operating systems support SMP including Linux26, VX
Works 27 , QNX28 , Enea 29 and more.
There are several types of embedded multi-core processors today that support SMP. There
are several processors built on the ARM 11, ARM Cortex A9 and forthcoming ARM Cortex A15
cores. Included in the list of multi-core Cortex A9 processors are the Apple A5 which powers
the iPad230 , the NVidia Tegra 2 which powers the Motorola Xoom tablet running Android
v3/Honeycomb 31 , the Samsung Exynos32 , Texas Instruments OMAP433 , and several more. In
the second group, we have standard MCU designs that are not based on ARM cores but do
support SMP. Examples are the PowerPC and Freescale MPC745x family based on the e600
core. Finally, there is an emerging class of many-core processors, such as Tilera's TilePro
family of 32-100 core processors, which support SMP as well3 4 .
Multi-core processors represent a significant departure from single-core processor
architectures and there is a significant amount of knowledge, tools, and product built around
single core architectures. SMP is special because it is essentially allowing the entire software
industry to maintain its current knowledge and software assets while realizing some level of
increased computational performance and battery performance from multi-core architectures.
The Futureof SMP
The processor industry has been speaking about "a day of reckoning" in which programmers
would need to abandon the sequential programming practices that have become so ingrained
26
http)://www.ibm.com/developerworks/library/1-linux-smp/
27 http://www.windriver.com/products/platforms/smp/
http://www.qnx.com/news/pr 1962 1.html
http://www.eetimes.com/electronics-news/4136532/Enea-debuts-multicore-OS-combining-AMP-SMPkernel-support?cid=NL Embedded
30 http://www.apple.com/ipad/specs/
31 http://www.anandtech.com/show/4 11 7/motoroIa-xoom-nvidia-tegra-2-the-honeycomb-platform
32 http://armdevices.net/20 11/02/11 /samsung-orion-exynos-4210-arm-cortex-a9-in-production-next-month/
3 http://focus.ti.com/pr/docs/preldetail.tsp?sectionld=594&prelld=sc0902I
3 http://www.tilera.com/products/processors/TILEPRO64
28
29
Page 50 of 106
over the last several decades and deal directly with the parallelism. This was the topic of a
famous article by Herb Sutter in 2005 entitled "The Free Lunch is Over" (2005) and was the
heavily discussed topic at the opening day of the 2011 Multi-core Expo (EE-Times, 2011).
For the last four decades, the processor industry has delivered increasingly faster processors
year on year. Software developers have been able to rely on newer and faster processors to
increase the performance of their products by virtue of the fact that their existing code would
typically run faster on faster processors.
Intel
Processor
CockSpeed(MHz)
1$s
197
1979
1984
1190
195
00
20
Figure22: Intel ProcessorClock Speed by Year35
Between the operating system and the architectural features of modem processors,
programmers today operate within a beautifully abstracted environment in which their processes
have the illusion of having massive amounts of memory, uninterrupted access to a processor and
protected, shared memory across all threads.
SMP has essentially allowed chunks of the embedded systems industry to begin adopting
multi-core processors while continuing to work in the threaded software paradigm. In essence,
the portions of the industry that have been developing applications atop operating systems that
support SMP, have been able to treat multi-core almost as an incremental innovation. SMP is
allowing the widespread programming practices used on single core processors to persist on
multi-core machines (Lee, 2006).
While SMP and the operating systems which support it, offer an effective abstraction layer to
allow multithreaded programs to migrate to multi-core devices, it is known to have limitations.
Many suggest that SMP doesn't scale beyond 4-8 cores (Gwennap, 2010). What happens next?
35
http://smoothspan.wordpress.com/2007/09/06/a-picture-of-the-multicore-crisis/
Page 51 of 106
7. Embedded Multi-core Processors Adoption
The preceding sections have provided a foundation by characterizing the dynamics of
adoption, the key characteristics of embedded systems, the general forms of parallelism and
concurrency and the key attributes of multi-core processors. This chapter explores the patterns of
adoption that are observed for multi-core processors in embedded systems that are primarily
driven by technical and architectural considerations; the following chapter builds on and expands
on this by considering broader system level factors and challenges affecting the adoption.
The first part of this chapter explores the important factors that influence the adoption of
multi-core processors in embedded systems. Adoption depends on the balance between their
benefits, such as either or both higher performance and lower power, and the challenges, such as
architectural issues, generality, code re-use, interactions amongst sub-systems and optimization
and debugging. The second part of this chapter builds on this by identifying four key adoption
scenarios and the likely outcome in each case.
Benefits of Multi-core in Embedded Systems
Performance
Embedded multi-core adoption is widely observed in performance driven applications where
the problem could not be solved as economically with other technologies (Gentile, 2011).
Examples of high performance embedded applications that may require the use of multi-core
processors include wireless base stations, test & automation equipment, high performance video,
medical and high-end imaging devices and high performance computing (Texas Instruments,
2011).
There are also several existing applications that have historically used multiple processors in
conjunction to solve problems. Phased array radar, cellular base stations and networking
equipment are few such examples. In application areas like these that have historically required
the partitioning of such problems across multiple processors, there are typically a few 'wizards'
in the organization who are well versed in architecting and programming multiprocessor and
multi-core systems, as well as the problem solving techniques needed to effectively develop and
deploy high performance applications.
Page 52 of 106
The problem is that such "wizards" represent a tiny minority of all embedded programmers
out there. Only 17% of engineers surveyed in VDC Research's 2011 Embedded Market study
had more than 4 years of experience with multiprocessor or multi-core design (2011) and several
of these likely fall into the design pattern case above.
Power Dissipation
Utilizing multiple cores at a lower clock speed but with higher aggregate computational
capabilities can actually dissipate less power than a single core system running at a higher clock
frequency. For example, the power dissipation values for the single and dual core variants of
the Freescale MPC8641 processor are shown below in Figure 23. The dual core configuration
running at 1.0GHz with a core voltage of 0.9V dissipates less power than a single core
configuration running at 1.5GHz with a 1.1V core voltage. In this example, the dual core is
providing a 33% increase in theoretical performance yet it requires around 16W instead of 20W.
25
0
5i
01
Figure23: Power Consumption Comparisonof single and dual core implementations of the Freescale
MPC864136
There are two types of power dissipation in processors: static and dynamic. Static power
dissipation originates from current 'leaking' through gates when they're not switching. Dynamic
power originates from gates switching in the device. Dynamic power dissipation in a processor
is a function of switching frequency, the amount of capacitance on the gates, and the voltage.
The equation below is used to estimate dynamic power dissipation where A is the 'activity
factor' or the percentage of gates that are switching; C is capacitance that the gates are driving; V
is the voltage and F is the frequency.
Pdy,
36
, = ACV 2p
(Freescale Semiconductor, 2009)
Page 53 of 106
As frequency increases, our dynamic power increases linearly. However, high clock speeds
may require a higher core voltage. Figure 24 below shows an example of the dynamic current
used by the Blackfin processor at different clock frequencies and core voltages. As clock speed
increases, the core voltage needs to increase and as a result, power increases exponentially with a
linear increase in clock speed.
Fh'q
y 0. 0 V 0.8 5 V O.90V O.9SV 1.00V 1.SV 1.10V
I I.SV .20V 1.2Vj 1.30V 1.32 V 1.37S V
50
12.7
13.9
15.3
16.8
18.1
194
100
200
250
300
73
400
423
475
50
22.6
40.8
50.1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
242
44.1
538
63
N/A
N/A
N/A
N/A
N/A
N/A
N/A
26.2
46.9
57.2
67.4
N/A
N/A
N/A
N/A
N/A
N/A
N/A
28.1
50.3
61A
724
88.6
93.9
N/A
N/A
N/A
N/A
N/A
30.1
53.3
64.7
76.2
91.,
993
N/A
N/A
N/A
N/A
N/A
31.8 34.7 36.2 38.4 40.5 430
43.4 45.7
569
59.9 63. 66.7
70.2 73.8 75.0 78.7
68.9 729
76.8 81.0 851
893
908
95.2
810
85.9 906
92
100,0 104.8 106,6 111.8
99.0
104.6
116.0 122.1 128.0 1300 136.2
1050
123.0 1294 135.7 137.9 144.6
111,0 117.3 123.S 129.9 136.8
1456 1526
N/A
1303 1368 141.8 151A 158.1 161.1 168.9
N/A
N/A
143.S ISO.7
1656 168.8 177.0
N/A
N/A
N/A
160.4 168.8 1176.5 179.6 188.2
N/A
N/A
N/A N/A
N/A
196.2 1199.61203
533
600
21.0
223
24.0
254
26.4
I10.3
11081168
138.7
27.2
287
4IA3VI1ASV
303
47.9
82.4
99.6
116.9
142.A
07
48.9
84.6
102.0
119.4
1145.5
1543
151.2
1597 162.8
176.6 179.7
185.2 188.2
196.8 1200.5
219.0 222.6
Figure24 Dynamic Current over Frequencyfor the ADSP-BF533 Embedded Processor3 7
If frequency is reduced, the core voltage can typically also be reduced which leads to the
power savings shown in Figure 23.
Size / Density / Cost
When compared to a multiprocessor system, multi-core processors often cost less and take up
less PCB release estate than the equivalent multiprocessor system.
They cost less because they require less material - a single package and die in the case of a
multi-core device, versus multiple packages and multiple dies in a multiprocessor system. For
example, the single-core ADSP-BF533 embedded processor has a budgetary pricing of $14.46 in
the BGA package while the dual-core ADSP-BF561 embedded processor has a budgetary pricing
of $20.40 in the BGA package. Two ADSP-BF533 devices will cost almost $28.92 versus
$20.40 for a dual core device38'39.
Multi-core devices can also be smaller than the equivalent multiprocessor system. For
example, the TI C6474 tri-core DSP has a 23mm x 23mm package size while the older singlecore version, the C6455, has a 24mm x 24mm package size . Granted, the C6474 is fabricated
at a 65nm process node while the C6455 is fabricated at 90nm which means that the same
http://www.analog.com/static/imported-files/data sheets/ADSP-BF53 I BF532 BF533.pdf
http://www.analog.com/en/processors-dsp/blackfin/ADSP-BF533/processors/product.html
39 http://www.analog.com/en/processors-dsp/blackfin/ADSP-BF561/processors/product.htm
40 http://www.eetimes.com/electronics-products/embedded-tools/4108275/Multi-core-DSP-offers-3-GHzperformance-at-i -GHz-price
37
38
Page 54 of 106
circuitry will be almost half the size 41; however, the PCB area consumed by three C6455 devices
is 1728mm2 while the PCB area consumed by a single C6474 device is 529mm 2. 4 2 Three cores
in a single package is 30.6% the size of the three single-core devices and this isn't factoring in
the supporting components for individual devices like decoupling capacitors.
Architectural Factors & Challenges
Despite the benefits, migrating to a multi-core processor is typically not a trivial task. There
are several technical issues related to both the structure of the application, or the manner in
which the application is partitioned across the cores, and the dynamics that emerge from new
kinds of interactions that are possible on multi-core devices that share hardware resources and
run software concurrently.
It's worth noting that when looking just at hardware / software systems, there are several
innovations that appear to be incremental or modular within the scope of that system. However,
when taking into account the full hardware/software system, innovation is almost always
architectural or radical to some degree. It's almost impossible to change software without
changing the nature of the interaction with hardware, as miniscule as it might be and vice versa.
For example, an incremental innovation in hardware like an increased clock speed of a processor
can change the timing of the entire software application and the software may need to be retuned to run optimally.
The following section will explore architectural challenges related to multi-core adoption as
shown in Figure 25. It will first examine the challenges related to the structure of the system
which, from a software perspective, is really how the application is partitioned across the cores.
From here, it will explore challenges related to the dynamics of the system which emerge from
the new types of interaction between software as well as between software and hardware that
become possible in multi-core systems.
2
2
2
4' 65 2nm=4225nm ; 90 nm=8100nm
2
42
23mm x 23mm x 1 = 529mm ; 24mm x 24mm x 3 = 1728mm 2
Page 55 of 106
Structure
-
Partitioning
Dynamics
--
Interactions
Architecture
System Optimization & Debug
Figure 25: Multi-core architecturechallenges
Structure / Partitioning
One of the biggest questions that programmers confront is how to decompose and rearchitect an application so that it can be effectively spread across multiple processor cores and
deliver increased benefit over existing or alternative implementations.
For applications that don't have a great deal of obvious data or task parallelism (or both),
determining how to segment an algorithm becomes increasingly difficult because the partitioning
boundaries aren't always readily recognizable. Furthermore, if the wrong decomposition
strategy is used, it's possible that the application could actually deliver lower levels of
performance than a single-core implementation. The problem gets increasingly difficult as the
amount of reuse and legacy code increases because, as we'll see, legacy code tends to be more
integral in nature.
"More cores don't necessarilyyield more processingpower, as a matter offact,
adding cores may impairperformance if the resultingdevice is not properly
balanced," - Pekka Varis, CTO of Multi-core and Media Infrastructureat Texas
Instruments43
Decisions made about decomposition and architecture will typically become increasingly
expensive to change later in the design as described in the section entitled Decomposition &
Modularity on page 16. What makes this step particularly difficult for multi-core processors is
the increased difficulty we have predicting behavior and performance, a topic that will be
discussed later in this section.
'
http://www.sys-con.com/node/ 1802536
Page 56 of 106
The difficulty in selecting the right decomposition strategy can vary significantly by
application. There exist classes of applications for which the decomposition of the problem into
pieces than can run concurrently is more obvious.
The first class of applications can be characterized as having a high degree of data
parallelism within the scope of the application task and these are commonly referred to as
"embarrassinglyparallel" applications. For example, certain applications with high degrees of
data parallelism can be more cleanly partitioned across multiple cores, particularly when the
amount of functional interaction required between the cores is low compared to the functional
interactions within the cores. In an image processing application, a 4 core system could be
configured so that each core processes one quarter of the image. The same technique can be
applied to video processing. The first generation of Cisco's Telepresence system (shown below
in Figure 26) utilized a large number of Blackfin processor cores in a parallel configuration to
encode and decode 3 streams of 1080p H.264 video with extremely low latency44
Figure26: Cisco Telepresence System45
For these types of applications, Amdahl's Law, named after computer architect Gene
Amdahl, provides a theoretical limit to the performance that can achieved. Amdahl's law states
that the potential speed-up of an application as the number of cores is increased will ultimately
be limited by the sequential code in the application which cannot be parallelized. As more cores
are added, the execution time for parallel sections of the code will become negligible but the
execution time for the sequential code will remain constant. The equation below provides an
estimate on the amount an application can be sped up as a function of P, the percentage of time
that the processor is executing parallel code and N, the number of cores (Amdahl, 1967).
More information on the usage of the Blackfin processor in this design can be found in the ADI press release:
http://www.analog.com/en/processorsdsp/blackfin/content/cisco telepresence vision becomes reality/fca.html?language dropdown=en
45 Image retrieved from: http://newsroom.cisco.com/dlls/2009/prod 022009.html.
44
Page 57 of 106
1
Speedup =
(1 - P) +
N
Equation 1: Applicationspeedup as a/unctionof cores andparallelizablecode (Amdahl's Law)
For example, if an image processing application consisted of 70% parallel code and it was
run on a 4-core machine, the expected speed up would be 2.1 1x.
1
= 2.11
1
(1 -. 7)+'7
Speedup =
Figure 27 below shows the relative increase in performance over a completely serial
algorithm.
Speed up relawe to auijal case
-
o
4
2
0
6
NMnAe
10
12
s
T
141
to
-
t
Of tweadk
4 6
Figure
There
is
tasks
operating
even
if
is
nature
its
to
Examples
arrive
(Sutter,
4
until
processing
of
be
a
applications,
al.,
streams
processor
computer
chip
to
tasks
be
and
network
assigned
percentage
as
These
very
to
of
packets
to
or
cores
other
code.
Gustafson
This
web
with
server
available
several
thus
is
similar
applications
each
and
(Gustafson,
a
of
because
each
code
having
types
sequential
segment
John
ofparallel
characterized
related
a
scientist
be
is
aren't
completes
can
threads
can
task
that
process
of
simultaneously.
the
tasks
after
number
which
within
sequential
named
by
applications
data
another
these
et
the
set
would
In
of
scaling
independent
own
Law,
Performance
class
of
'wait'
Gustafson's
requests.
second
on
the
running
needs
a
27:
scale
processor
no
known
core
processor
as
1988).
processing
web
bandwidth
as
they
58
of
2005).
(Gove, 2011)
Page
106
Generality vs. Performance Tradeoffs
Generality is a characteristic of software which describes how easily it can be used across
platforms without changes. Clearly, in the domain of multi-core where platform changes can
require significant software changes, the concept of generality is an important one.
Unfortunately, increasing the generality of the software also negatively impacts both the
performance of the software and the productivity of those developing the software (McKenney,
2010). For applications which are pushing the performance curve in the embedded space, a
highly modular and abstracted architecture may not be an option.
Legacy Code / Re-Use
Determining the right partitioning strategy for a multi-core application is difficult to begin
with. The problem gets increasingly difficult as the amount of reuse and legacy code increases.
Alex Bachmutsky, a chief architect at Nokia Siemens Networks recently commented:
"one of ourproblems is how to parallelize existingprograms with eight or 10
million lines of code--you can rewrite everything but it will cost a lot of money."
(2011)
Modular software can require less work to migrate to a multi-core architecture when its
modular decomposition lends itself well to concurrent execution (which isn't always the case)
(Rajan, et al., 2010). Existing embedded applications however, tend to be more integral by
nature for several reasons.
First, code developed by small software teams tends to be integral in nature due to the fact
that there is high level of communication between team members. Alan MacCormack's research
on Linux and Mozilla suggests that software development occurring at a single location where
engineers can easily communicate, results in more integral designs. A single team can solve
problems face-to-face and can easily coordinate changes to the architecture of the software in
order to improve performance. These types of interactions naturally lead to tighter coupling
between components in the system (MacCormack, et al., 2006).
This research has implications for firms with legacy code that they wish to migrate to a
multi-core processor. Given that most embedded projects have just a handful of software
engineers (UBM/EE Times Group, 2010), this research also suggests that more likely than not,
these existing software designs are more integral in nature because they're developed by a small
Page 59 of 106
number of engineers, a majority of whom are likely co-located. If this were the case, these
organizations would have more difficult challenges in order to perform the migration.
Second, software applications have a tendency to "decay" over time due to changes made to
the system within which the software is operating and also due to incremental feature changes.
This decay manifests itself as a breakdown of modularity, increased effort to make changes,
increased fault potential and increase in the span of required changes (Eick, et al., 2001).
Finally, many embedded applications are written with performance and real-time processing
constraints (UBM/EE-Times, 2011) which may require a less modular design (McKenney, 2010).
Integral applications can be the most difficult to repartition because they are not clearly
partitioned to begin with. Embedded software developers may need to perform a substantial
amount of work decomposing and rearchitecting the software. It some cases, it may be easier to
start over rather than trying to rework the existing legacy code.
A majority of embedded software still targets single-core processors (UBM/EE Times Group,
2010). This implies that a lot of existing legacy code was originally written for a single-core
processor. As we will discuss in the next section, software architected for a single core may
encounter new failure modes in a multi-core system and this code may need to be reworked to
run reliably on a multi-core architecture.
If the code was sourced externally, the firm may have access only to the compiled object
code and not the source code which makes porting and debugging that code more difficult,
particularly when the firm who developed the code is unable or unwilling to support or update
the code.
If there is existing code that has not been reworked for multi-core processor, large-scale
locks may need to be placed in the system around these sections of existing code to protect
against these new failure modes. A prime example of such a situation can be seen in the history
of the Linux operating system. When SMP support was first added to the Linux kernel, there
were several pieces of existing code that were subject to new and unfamiliar types of failure
modes not previously possible on a single core processor. To protect against these, a Big Kernel
Lock (BKL) was implemented as a crude spin lock to ensure only one core was running in kernel
mode and thus halt any concurrent operations in other cores that could lead to these new
concurrent failure modes. Subsequently, several modules within the Linux kernel have been
Page 60 of 106
modified to support finer grained locking and the BKL is used to support old code (Bovet, et al.,
2006).
Dynamics lInteractions
This section will explore some of the new challenges related to types of interactions that are
possible in multi-core systems.
Data Races
A data race is one of the most common bugs found in applications running on multi-core
applications (Gove, 2011). These occur when tasks running on two different cores attempt to
modify the same piece of data in memory at the same time. In a threaded application running on
a single core, this type of failure isn't possible because at one time, only one thread is accessing
memory. However, on a multi-core system, it is now possible for two threads/tasks to execute
simultaneously and thus it's possible for them to simultaneously access memory and other
system resources.
Shared Resources
Consider a situation in which tasks running on the various cores of a multi-core processor all
require access to external memory, but the memory interface and/or bus structure of the chip has
limited bandwidth. As more cores are added, the tasks running on those cores will need to wait
longer for external memory accesses to complete, and thus the overall system performance
wouldn't scale linearly as more cores were added.
Studies have shown that memory bandwidth and the memory management schemes are
perhaps the strongest limiting factors in the performance of multi-core systems. The data shown
in Figure 28 is from a study that shows how multi-core performance begins to degrade after 8
cores and by 16 cores, the performance is on par again with a dual core device (Sandia Labs,
2009).
Page 61 of 106
t-ORA 10H4(DRAM03
.0M 3
' RU20642Stcxd A
.41
0.025
002
0,015
0.01
..
0
2
4
0
10
CW Nads
NumbeWof
32
64
Figure 28: Multi-core performance vs. number ofcores"
The example below demonstrates a similar phenomenon. Three processor cores are running
independent threads that are sharing a common memory bus and cache. As the number of cores
increases, the performance of each core decreases through the use of shared resources. In this
case, cache thrashing and memory pipelining slow down the execution rates of the threads on
each core (Uhrig, 2009).
1.00
-- SingleThread Perfomance Los
--
Ov.ra
Performanceain
U
*1.04
E t.03
1,01
1
2
3
Processorm
Figure29: Performance impact as afunction of cores (2-3 threadsper core)48
System developers not only need to understand how to partition their application across
several cores, but they also need to think about memory performance. Bus and memory
bottlenecks like the ones described above can have adverse impacts on application performance
if not understood and accommodated in the overall system design.
47 (Sandia Labs,
41 (Uhrig, 2009)
2009)
Page 62 of 106
Synchronization and Inter-Core Communication
The amount of interaction between code modules running on different cores increases the
challenges of partitioning the application for several reasons. First, the cost of interactions
between cores in multi-core systems is typically higher because of the performance of shared
memory and the requirements for synchronization (McKenney, 2010). As we increase the
number of threads in a system that are working together on the same problem, the number of
synchronization events between these threads will likely increase as well. The cost of
synchronization can begin eliminating performance gains from increasing the number of threads
as shown in Figure 30(Gove, 2011).
Speed up relaive to serial case
3-
......
aa
__
_________
12
0%Pard
.... 70% Parape
.....
.....
S
0
I
II
2
4
6
8
I
I
|
10
12
14
16
I
Number of threads
Figure 30: Thread scalingwith exaggeratedsynchronization overhead9
Cache Thrashing
Another problem is cache thrashing. Cache thrashing can occur when the data that the core
is operating on is larger than the cache and data needs to be regularly loaded from slower L2 or
external memory as shown in Figure 14 on page 44. If the operating system migrates a thread
from one core to the other, it loses its datalocality, or the fact that the instructions and data
related to that thread which were cached locally in the other core (QNX, 2010). Marcus Levy of
the Multi-core Associated recently noted, "Some of the telecoms OEMs are really struggling
because they are finding in the shift from using two single-core chips to one multi-core chip
performance is going down. That's because they now have to share resources like caches."5 0
49
50
(Gove, 2011)
http://www.eetimes.com/electronics-news/4076675/Groups-debut-multi-core-API-benchmarks
Page 63 of 106
There are several other technical challenges that are emergent in nature but this section has
provided a summary of the key challenges.
System Optimization and Debug
Because multi-core systems are can be less predictable than single core design, system
optimization is often an essential step of a multi-core design. Furthermore, system debug is a
notoriously painful process in embedded system design and for multi-core processors, it can be
even more difficult because the tools aren't as mature, interoperability between the tools is weak
and the types of bugs that emerge from interactions can be especially tenacious and difficult to
track down.
Predictability
The static analysis techniques that are used in single core designs, particularly for safetycritical systems with hard real time constraints, are significantly more difficult on multi-core
devices and in some cases, are infeasible. The difficulty originates from the new forms of
interactions across shared resources. As several cores access a shared resource, it's extremely
difficult to capture all potential sequences of accesses to that resource. Each access could change
the state of the resource and different sequences of access may take different amounts of time to
complete (Cullmann, et al., 2010).
The unpredictable nature of multi-core design is challenging at the system level because
unanticipated interactions can degrade system performance to a point where a single-core device
offers more performance than several cores in a multi-core device, as we saw in the previous two
sections. However, for embedded systems with real time requirements, the inability to guarantee
that certain operations will always complete within a set amount of time may be unacceptable.
Svstem Tunina and Debua
System tuning and debug are already time consuming tasks in an embedded design and only
get more difficult with multi-core. As noted above, there is a certain level of unpredictability
inherent in multi-core design due to the complexity of possible interactions over shared resources.
Tuning strategies tend to be coupled to the architecture. For example, consider the situation
in which 2 cores are working together on a processing task. If it takes 100 cycles to synchronize
with the other core (a function of the processor architecture and my synchronization
mechanisms), it may be possible to pipeline the algorithm such that it can perform computations
Page 64 of 106
for those 100 cycles and thus don't waste any processing time. If this application were ported to
another dual core architecture that had different memory performance and clock speed, and the
number of clock cycles to perform synchronization changed, the application would no longer be
optimally pipelined and would spend some time waiting for either the synchronization or the
compute operation to complete.
Tracking down the sources of race conditions and dead locks on multi-core processors can be
very difficult because these events may be sporadic and when they do occur, are dependent on
the state of several processors not just one. Different problem solving techniques are often
needed to effectively debug these new types of problems.
Diversity and Quality of Tools and Methodologies
The most time-consuming part of embedded designing is testing and debugging the system.
If embedded engineers could improve one thing about an embedded design, they would likely
improve the debugging tools (UBM/EE-Times, 2011). As noted earlier, debugging single core
designs is challenging and because there is a great deal of variety in the debug features in
processors and development tools, a design team typically develops a set of problem solving
techniques around the functionality of those tools.
There are several new challenges in developing tools for multi-core system development and
this is reflected in the state of tools presently available for embedded multi-core processors.
In the general purpose space which has a small number of processor architectures and
operating systems, tools have become fairly sophisticated. Windows developers have access to
a broad range of tools, libraries and resources provided by Microsoft, Intel and several third
parties, for authoring Windows applications to run on Multi-core processors. Microsoft's Visual
Studio 2010 ships with a parallel performance tuning tool51 and Intel has their Parallel Composer
that works with Visual Studio to provide compilers, libraries and debugger support for multi-core
processors5.
In the embedded space, different suppliers employ different strategies to provide tools and
the landscape is far more fragmented. Some suppliers leverage an open source tool chains like
s' http://www.microsoft.com/downloads/en/details.aspx'?FamilyID=8FFC2984-AO5C-4377-8C699A8BOD2B5D 15
52 http://www.informationweek.com/news/hardware/processors/217700703
Page 65 of 106
GCC and debug environments like GDB/Eclipse while other suppliers develop and maintain
proprietary tool chains. This has a few important implications.
Just like processor architecture, tool chains need to be learned and understood. Different
tools have different features that allow users to solve problems in different ways. Therefore,
there is some degree of tacit knowledge that is built around not only processor architecture but
also the corresponding tool chain. Migrating to a new processor that is similar in architecture but
different in tools support, may require a similar amount of effort to both learn how to code for
performance and debug system issues with different types of debug features and capabilities.
In a recent panel discussion as part of the EE Times "Approaching Multi-core" virtual
conference, several industry pundits commented on the state of multi-core tools. Because of the
unpredictable nature of applications running on multi-core processors, simulation and modeling
tools have become increasingly important to help guide partitioning decisions and predict
performance. Tasneem Brutch13 recently noted that there are several new types of debugging,
analysis, development, profiling and auto-tuning tools that are used for multi-core design and
that almost every multi-core application requires a unique configuration of tools. There is also
very little interoperability between these tools which makes analysis difficult (2010). Tom
Starnes 54 noted that accurately simulating and modeling the performance of multi-core systems is
very difficult, particularly because the interactions between the cores become very complicated.
In addition to simulating the cores themselves, simulator tools need to be able to accurately
model cache, cross-bars, memory, 1/0, interrupts as well (2010).
Adoption Scenarios
This section will examine four different adoption scenarios. We will explore the nature of
the architectural change to determine architectural factors that can impede or expedite adoption
of multi-core devices. The following section will then explore additional factors and
mechanisms which affect adoption in the adopting organization, the surrounding ecosystem and
beyond.
1 Tasneem Brutch is a senior staff engineer at Samsung Research and also setup and chaired the Tool
Infrastructure Work Group in the Multi-core Association
54 Tom Stames is a Processor Analyst at Objective Analysis
Page 66 of 106
Current Multi-core Adoption Patterns
Multiprocessor designs are common in embedded systems. According to EE Times 2010
Embedded Market Study, only 50% of projects in 2010 used a single processor. The average
number of processors in a project in 2010 was 2.6 and this number hasn't changed dramatically
in the last 4 years (UBM/EE Times Group, 2010). Because the functions of an embedded system
are well understood at design time, various computing tasks can be placed on different types of
embedded processors that may be suited for the task. Despite the benefits of multi-core devices
discussed earlier, the transition to multi-core has been happening slowly across the industry. In
2010, 15% of designs using multiple cores used a multi-core device, up from 9% in 2007
(UBM/EE Times Group, 2010). In some areas, adoption is rapidly outpacing this industry
adoption rate of 6%.
Case 1: Existing Multiprocessor Hardware Design Pattern
The first design scenario is porting a multiprocessor design to a multi-core processor that
mirrors the existing multiprocessor system architecture. In this scenario, the class of processors
within the multi-core device and the means by which they are communicate with each other and
the rest of the system, remain mostly intact. We essentially have an existing architectural pattern
that is being migrated to the multi-core design.
As we will see, this scenario represents one of the easier cases because most of the
knowledge assets associated with the architecture can be persevered. However, as we'll see, the
degree of difficulty is heavily impacted by the extent to which new types of interactions are
possible and occurring.
This scenario can be commonly seen in two forms.
The first form is an existing system utilizing a heterogeneous collection of multi-core
processors that is migrated to a heterogeneous multi-core device as shown in Figure 31 below.
A
E
C
Figure 31: Case 1, heterogeneous architecture
Page 67 of 106
A common application example of this scenario can be seen in the DSP + microcontroller
architectural pattern that, from the author's professional experience, has become a common
system architecture over the last several decades. This pattern can be seen in several applications
such as handsets, surveillance cameras, consumer multimedia products. In this system
architecture, a microprocessor (typically an ARM) is running an operating system, managing
communication and running applications. The DSP is typically running computationally
intensive algorithms specific to the application. In a handset, the DSP is running the digital
modulation and demodulation algorithms. In an IP camera, the DSP is running video encoding
and analytics algorithms. In the last decade, a class of heterogeneous processors has emerged
which contains both the DSP and the microprocessor like the Texas Instruments OMAP family
of devices55 .
In the second form, an existing system utilizing a homogeneous collection of multi-core
processors is migrated to a multi-core device as shown in Figure 32 below.
A
Figure 32: Case 1, homogeneous architecture
Homogeneous arrays of discrete processors are commonly used in performance-driven
applications where the processing is relatively homogeneous in nature and can be divided
between a homogeneous collection of processors. A common application example of this
configuration can be seen in wireless communications applications where several DSPs may be
required for baseband processing. The Texas Instruments TMS320C6474 tri-core DSP is
optimized for communications processing and utilizes three C64+ DSP cores. It is designed to
be a multi-core migration path for multiprocessor designs built around the single-core
TMS320C6455 which also utilizes the C64+ DSP cores6
ss Texas Instruments OMAP Processors:
http://focus.ti.com/general/docs/wtbu/wtbugencontent.tsp?templateld=6123&navigationld=l 1988&contentld=4638
56 TMS320C6455 to TMS320C6474 Migration Guide - http://focus.ti.com/lit/an/spraav8/spraav8.pdf
Page 68 of 106
New Interactions
In some cases, multiple processor cores are integrated into a multi-core device but the
manner in which they connect to each other and the surrounding devices is almost identical to the
multiprocessor architecture.
II
- A
L
B
-
Figure33: Case 1, no new resource sharing
An example of this type of architecture can be seen in the ADSP-14060 DSP, which is a
multichip module containing 4 ADSP-21060 processors. The processors are connected within
the ADSP-14060 exactly as they would be in a multiprocessing configuration utilizing four
ADSP-21060 devices. In the author's experience, migrating an existing software architecture
built for the equivalent ADSP-21060 multiprocessor system to this multi-core device is typically
a trivial process.
FUNCTIONAL
BLOCK
DIAGRAM
AD14es4moLe
Figure 34: ADSP-14060 QuadprocessorDSP57
If the multi-core device does not introduce any new possible forms of interaction between the
existing system software and the hardware, the architectural change required is minimal and the
migration from a multiprocessor to a multi-core design can happen very quickly.
" http://www.analog.com/static/imported-files/data sheets/ADD14060 1 4060L.pdif
Page 69 of 106
Most multi-core processors however, do utilize shared resources and this can introduce some
new types of interaction between the two cores that weren't possible in the previous
multiprocessor system architecture.
A
B
Figure 35: Case 2, new resource sharing
The tri-core TI C6474 is designed to replace three single-core C6455 DSPs. The three cores
however, now need to share several resources, which wouldn't have been shared in a
multiprocessor architecture. As shown in Figure 36 below, all three cores now share a common
memory bus. The external memory bus on the tri core device is slightly faster running at
667MHz versus the memory bus on the single core device which ran at 533MHz; but this
memory bus is now being shared by 3 cores. The L2 cache memory is also smaller. The singlecore C6455 has 2MB of L2 cache while the C6474 has 3MB of L2 cache utilized by 3 cores58.
L
*low
Figure 36: C6474 Block Diagram"
An existing system design that was based on three C6455 DSPs may behave very differently
once ported to a single C6474 device because of the increased sharing and contention. All three
cores are sharing a common memory bus and each core, on average, has half the amount of L2
cache. If the existing applications running on the C6455s were not memory intensive, the rate at
which the software interacts with these shared memory resources will be lower and may not
significantly change the behavior of the system. However, if the C6455s were running memory
TMS320C6455 to TMS320C6474 Migration Guide - http://focus.ti.com/lit/an/spraav8/spraav8.pdf
59 http://www.eetimes.com/electronics-products/embedded-tools/4108275/Multi-core-DSP-offers-3-GHznerformance-at- I -GHz-Drice
5
Page 70 of 106
intensive applications, the level of interaction with the shared resources will be high. Contention
on the external memory bus and the smaller L2 cache may have more profound effects on the
system performance.
The difficulty of adoption of multi-core devices in this scenario is heavily impacted by the
nature of new interactions that are possible in the new multi-core architecture and the frequency
with which these interactions occur. System performance may degrade as a result of these
interactions and the amount of architectural change that may be required to mitigate these
degradations will increase as the rate of interactions increase.
Examining this transition through Kim & Henderson's framework on innovation, a great deal
of existing component and architectural knowledge assets are preserved through the migration.
At the processor level, this innovation could be characterized as 'incremental' if no new
interactions emerged or 'architectural' if new interactions emerged that required new knowledge
to address. Shared resources allow new interactions between components of the system that
previously did not directly interact; meaning that existing architecture knowledge may be
destroyed and new architectural knowledge of these interactions may be required to successfully
complete a design.
In summary, migrating to a multi-core processor whose architecture mirrors the existing
system architecture as discussed in the first scenario, is one of the easier cases because often
times, the firm's existing knowledge assets can be preserved. However, as new resource sharing
requirements are placed on the software running on each core, the shift becomes more
architectural in nature which can require new architectural knowledge and problem solving
techniques.
Case 2: Existing Software Design Pattern
The second migration scenario is a transition from a single core to a multi-core SMP
architecture where existing applications are built on top of a POSIX-compliant multi-threaded
operating system with SMP support.
A
a
a
Figure 37: Case 2, Symmetric Multiprocessing
Page 71 of 106
The combination of POSIX compliance and SMP in an operating system provides an
abstraction layer that, in theory, allows a series of applications developed against this software
design pattern to be migrated to a multi-core processor without fundamentally changing the
architecture as shown in Figure 38. The operating system or the user can allocate threads and
processes to different cores and the system runs in a similar manner to a multi-threaded
application on a single-core processor.
POSIXCompliant
Applcation
POSIXApplication
POSIXCompliant
Compliant
Application
POSIXCompliant
Application
POSIXan
Application
POSIXCompliant
Application
Operating System
Operating System
POSIX-Compliant with SMP.Support
POSIX-Compliant
CPU
CPU
CPU
System
Interconnect
System
Interconnect
Memory
Cntroller
I/E/o
IOontrol:er]
e
or
60
Figure 38: Migrationfrom single-core to dual-core with SMP & POSIX
In practice, there are several subtle issues noted earlier in the section on Dynamics /
Interactions Error!Reference source not found. that can dramatically impact the ease of
migration and the number of architectural changes required for the system to run reliably and at a
higher level of performance over the single core implementation. However, these issues can be
mitigated by keeping processes and threads pinned to individual cores. This is known as
'process affinity' and 'thread affinity' (Gove, 2011) (QNX, 2010). While this prevents the
operating system from allocating threads to different cores for load balancing purposes, it does
preserve the single-core execution model for each process. This can be used as a stepping-stone
to getting an existing application running in full SMP mode on a multi-core processor.
A segment where this type of transition has been happening rapidly is in the smart phone and
tablet markets. In 2011, 15% of smart phones are expected to ship with a multi-core processor
and this number is expected to rise to 45% by 2015 (Strategy Analytucs, 2011). Many of the
major tablet manufacturers are also moving to multi-core in 2011. The Apple iPad2 utilizes the
Apple A5 dual-core processor; LG, Samsung and Motorola are developing tablets around
60
Adapted from a diagram n htim//www.iomagazineonIine.com/IQ/IO I7/LowRes.pdfs/Pgs34-37
1017xd
Page 72 of 106
NVidia's dual-core Tegra2; the Research in Motion (RIM) Playbook utilizes the Texas
Instruments dual-core A9 Omap 4430 (EE-Times, 2011).
The operating system on Apple's iPad 2 is iOS which is an SMP, POSIX-compliant
operating system6 1. RIM's Playbook runs QNX which is an SMP, POSIX-compliant operating
system 62. The Xoom from Motorola runs Android which is built upon Linux, a mostly POSIXcompliant, SMP OS 63 . ARM and ST-Ericsson have been working to improve SMP support in
the Android kernel (ARM, 2010).
Most of these tablet manufacturers haven't fully opened up multi-core programming to the
application develop community and are taking an incremental approach. For example, Android
applications are written against Dalvik, which is the process virtual machine in Android. Dalvik
processes currently run on only one core of a multi-core system which, like process affinity,
helps protect against certain software-software and software-hardware interactions that can
hinder performance and reliability (Moynihan, 2011).
The lack of determinism in multithreaded applications can be compounded by resource
sharing and cache thrashing scenarios. Developing applications against the POSIX/SMP design
patterns with real-time requirements can be significantly more difficult in a multi-core situation
because of the unpredictable nature of the new interactions (Cullmann, et al., 2010). Similarly,
if the application is performance driven, pieces of the system may need to be rearchitected to fine
tune the synchronization mechanisms, cache utilization and shared resource utilization.
In summary, following the multi-threaded POSIX-compliant application design, patterns may
allow for the preservation of architecture and architectural knowledge. There are still several
challenges that can be encountered when moving from a multi-threaded software architecture
noted in the previous sections and which require architectural changes and new problem solving
techniques; however, through utilization of task and process-affinity techniques, a developer can
take incremental steps to a full SMP system by manually pinning processes and threads to
individual cores and reduce the number of possible interactions.
61
http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Multithreadiniz/CreatinjzThreads/Creatin
gThreads.html
62 http://www.engadget.com/2010/09/27/rim-introduces-playbook-the-blackberry-tablet/
63 http://www.motorola.com/staticfiles/Consumers/xoom-android-tablet/us-en/overview.html
Page 73 of 106
Case 3: No Existing Patterns with Legacy Code
The third design scenario is one in which no existing design pattern exists. In this situation,
an existing single-core design is being migrated to either a homogeneous or heterogeneous multicore processor.
B
A
C
AB
C
Figure 39: Case 3, Migrationto homogeneous and heterogeneousscenarios
Unlike case 1 and case 2, there are no existing design patterns (hardware or software) that
can be used to facilitate migration to this platform. Furthermore, this scenario involves legacy
code which, as noted earlier, may be more integral in nature and more difficult to partition.
An example of this scenario can be seen in performance driven applications where single
core processors no longer offer enough processing power and the only viable solution is to add
more cores. The existing application needs to be re-partioned and re-architected so it can be
distributed across the cores. From here, there are several challenges with respect to the dynamics
of the application as noted above.
If the existing single-core application is modular and the decomposition aligns with a viable
partitioning strategy, the level of architectural changes may be reduced. However, in most
circumstances, this type of migration affects both component and architectural knowledge and
thus is radical in nature.
Adoption in these scenarios will be costly, adoption rates will be slow and penetration will
lag, as discussed in the Legacy Code / Re-Use section above and the cost and technical difficulty
of adoption may mitigate any positive benefit that a multi-core architecture could offer.
Case 4: No Existing Patterns with New Code
Case 4 is very similar to Case 3 but we are dealing with a new software architecture that can
be architected to reflect the processor architecture it will run on.
Page 74 of 106
Figure 40: Case 4, New development to homogeneous SMP, homogeneous AMP or heterogeneous MC
The existence of legacy code can impact the difficulty of adoption, particularly if this legacy
code hasn't been developed to run in a multi-core or multithreaded environment. Furthermore, if
the legacy code was sourced externally, it may be difficult to understand the type of interactions
possible without access to source code which may add increased risk to the project.
Summary ofAdoption Scenarios
If an existing architectural design pattern exists, either in the form of hardware and software
that we explored in case 1 or in the case of software over an SMP abstraction later as we
explored in case 2, several elements of the architecture can be preserved in the migration from
signal core to multi-core which also means that existing knowledge assets and problem solving
techniques can be applied as well. There are factors in each scenario that we explored which can
impact the degree of new architectural knowledge that is required for a successful transition
(system functionality and increased performance on the desired parameter); however, the scale of
the new architectural knowledge is small compared to cases 3 and 4. As such, we have observed
that the transition to multi-core in these first two cases has been easier for industry and adoption
can happen broadly across one generation of products as we saw in tablets.
If no architectural pattern exists, it is possible that a new architecture will be required and
with it, new architectural knowledge and problem solving skills. Modularity in legacy code that
aligns with partitioning strategies can reduce the extent to which the architecture needs to change.
Figure 41 on the next page provides a summary of the four adoption scenarios covered in this
section.
Page 75 of 106
Type of Design
New Design
%v
aw namaeu fC A$
___E
ting Design
o*w
4
Exisn H
O0
E
New Hardware
Architecture
Existing Hardware
Architecture / Pattern
j.~1
haVtaao
A
B
aCc's ramam
new
VC?
B
-']E
-Ir
Existing Software
A
,,,,-
.
Architecture/Pattern
w*.E]A]
FA
Figure 41: Multi-core Adoption Scenarios
Page 76 of 106
8. System Level Factors and Challenges
In the previous section, we examined technical and architectural factors that can significantly
impact the difficulty of migrating to a multi-core processor. We explored several scenarios
which varied in the degree of difficulty that were ultimately impacted by the availability of
existing design patterns, the existence of new types of interactions that were not present in the
existing design, partitionability of the application, legacy code and code sourcing strategies.
This next section explores several non-technical factors that impact multi-core adoption. It
examines the organizational and managerial factors and dynamics within a firm adopting a multicore architecture. And from here, examines the factors and dynamics surrounding the firm such
as the suppliers and value chain. Finally, it examines factors and dynamics related to human
cognition and behavior that affect adoption.
Figure42: Layers ofAdoption Factors& Dynamics
Factors and Dynamics within an Adopting Firm
Beyond the technical and architectural elements, there are several other factors that may
influence a firm looking to adopt an embedded multi-core processor.
Diversity of Platforms & Design Space Mobility
Processors have certain characteristics that can be used to differentiate them from other
processors within a given product class. For example, the operating frequency and instructionset architecture of an embedded processor are two key characteristics that can be used to
delineate several subclasses of processors. Other defining characteristics of embedded
Page 77 of 106
processors include the memory size, the memory architecture (cache, shared memory,
performance), and the 1/0 peripherals.
Baldwin and Clark define an artifact's 'design space' as the total number of designs that can
be constructed using all possible combinations along these dimensions (2000). An important
characteristic of single core embedded processors is that while they have a relatively large design
space, the process of moving an application written in C from one processor to another, generally
allows the architecture of the software to be preserved, particularly when moving to a processor
with more memory and performance. A certain amount of tuning is typically necessary but
applications don't necessarily need to be rewritten from scratch in this scenario. While the
design space for single-core embedded processors is quite large, migrating between locations
within the design space doesn't necessarily require rearchitecting the software.
With multi-core processors, a few additional key dimensions are added to the design space:
the number of cores, the types of cores and the means by which the cores communicate.
A software architecture that has been optimized and partitioned for two cores, for example,
will likely have a different architecture if the application were split across four cores.
When all of the cores are identical, migrating to a multi-core ISA with a different type of
processor, but with similar fundamental capabilities like ARM to MIPS may not require a great
deal of rearchitecting. However migrating to another multi-core which uses a heterogeneous
collection of cores with differing capabilities, may require a fundamental rearchitecting of the
system because the performance of the software may change more dramatically on a different
ISA.
Finally, different memory structures and latencies may mean that a certain partitioning and
tuning strategy for an N-core architecture may run very differently on another N-core
architecture with similar core types.
Much of the system may need to be rearchitected when migrating to a multi-core device.
And more importantly, it may need to be rearchitected again, moving to a device with a different
number of cores. In the past, embedded developers could migrate around the existing design
space by keeping their applications in C and typically making modular enhancements to their
applications. With multi-core, migrating to new nodes in the design space will typically be more
architectural and radical by nature.
Page 78 of 106
Platforming Limitations
Consider an end-product platforming strategy based on single core devices. A low end
product could be based on a slower processor running a subset of features; a more expensive
product could be based on a faster processor with more software features. Because of the
decreased mobility in the design space, adding more cores can be significantly less trivial. A
low-end product based on a dual-core processor and a high-end product based on a four core
processor without using SMP will likely require two different software architectures. If SMP is
used, it's possible that the performance increase gained from moving from two cores to four
cores is not significant due to the scaling limitations of SMP noted earlier.
Platform Flexibility. Coupling &Second Sourcing
Having a viable second source for components within a system design is a commonly used
risk mitigation strategy in the electronics industry. A product's success is dependent on the
ability of the suppliers to deliver working components on time and at the agreed upon price. If a
supplier fails in any of these regards, it can jeopardize the entire product.
Second sourcing is typically easier for simpler, more standard components like resistors and
capacitors because there are several suppliers who make interchangeable products. However,
second sourcing gets increasingly difficult for custom-designed and complex components that
may be unique to a single supplier.
Processors are particularly unique in this regard because there are several other forms of
coupling that exist. The software developed for that processor may also require a custom
architecture if the processor's design is a departure from the more dominant design patterns like
a many-core processor. Also, there are also several knowledge assets within the firm which are
coupled to a processor. Engineers have learned the architecture and more importantly they've
learned the design tools, ecosystem partners and how to solve problems. Switching to another
processor vendor who offers similar processor architecture with different tools and a different
third party network can still require a significant switching cost.
Migrating between the myriad of new ARM multi-core SMP processors can be almost trivial
if you're running an SMP OS like Linux where most of the ARM tools and ecosystem are
available regardless of vendor. However, migrating between other multi-core processors that
don't have a similar pervasiveness is far more difficult. The cost of migrating an architecture
Page 79 of 106
and a firm's knowledge assets to this platform will be high and more importantly, the costs of
then switching to a different architecture may be equally high.
Platform Evaluatability and Learnability
An adopting firm will likely want to assess the performance of a new processing platform
before they commit to adoption. This process typically involves taking existing algorithms,
porting them to the new processor platform and evaluating the execution speed and efficiency.
The difficulty of this process for the adopting firm can significantly impact the likelihood of
adoption.
In 2007, the author was visiting a large video surveillance company that was evaluating
digital signal processors for a next generation video analytics platform that could intelligently
identify objects and people. The firm had a large amount of existing C code that was
implemented as a single-threaded application and was highly integral. The firm was evaluating a
single core DSP from one firm and a dual-core DSP from our firm. The processors had similar
computational capabilities but the key difference is that the competing firm offered these
capabilities on a single core where we had two cores with half the computational capabilities that,
when summed, were equivalent to the single core DSP. The firm wanted to evaluate the
performance of their existing application on both processors and the evaluation report that came
back weeks later showed that the dual core processor had achieved half the performance of the
single core processor because the firm evaluating the platforms was not willing to reengineer
their code to run on two cores just to evaluate a device. Itwas far easier for them to use the
single core DSP from the competing firm because they were able to quickly assess the type of
performance that could be achieved with this platform.
Radical architectures that are able to still provide a gentle slope evaluation, allowing
adopting firms to make short term progress and incrementally achieve full performance, can be a
key to success since adopting firms may not be willing to invest a great deal of time to climb a
cliff before they can begin to assess the likely performance that the platform will offer.
Hiring Software Engineers who know Multi-core
When it comes to multi-core development, most embedded engineers are new to the domain.
VDC Research's 2010 Embedded Engineering Survey found that slightly more than half of
embedded engineers have some level of experience with multi-core or multiprocessor designs as
shown in Figure 43. And half of those with experience have less than 2 years of experience. A
Page 80 of 106
majority of those are developing systems that have historically used multi-core or multiprocessor
devices like networking, mobile phones, consumer electronics, and telecom and data
communications systems (VDC Research, 2011).
> than 4
years 17%
1 - 2years
8%
6 -12
months
8%
6 months or
less 12%
Figure43: WW Respondents Working with/ProgrammingMulti-core and/orMultiprocessorDesigns
For a firm embarking on a new multi-core design, they need to decide whether to bring in
outside help in the form of contractors or new hires who have the right experience level, or
essentially try to learn as they go.
Non-Deterministic Development Cycles & Costs
Because performance of multi-core systems is unpredictable (as discussed earlier in the Task
Parallelism section) and the new types of bugs that emerge can be particularly tenacious,
software development cycles around multi-core processors can be unpredictable themselves,
particularly for firms which haven't developed the knowledge and problem solving skills around
the technology. A Whitepaper from VX Works cites a VDC research report that 25% of multicore projects miss deadlines and functional targets by at least 50%6. Multi-core technology may
not be compatible with tight product deadlines, particularly when the development team is new
to multi-core.
Technology Tradeoffs &Alternatives
If migrating to a multi-core processor is going to both destroy and require the build-up of
several knowledge assets within a firm, they may take a longer and deeper look at alternative
technologies like field programmable gate arrays (FPGAs) and hardware acceleration.
6
http://www.mistralsolutions.com/newsletter/ianl
1/servefile.pdf
Page 81 of 106
As Jeff Bier of BDTI noted:
"If system companies arefaced with sigificantswitching costs due to their chip
vendors switching processorarchitectures,it's likely they're going to take the
opportunity to see what competing chip companies have to offer. "65
There are new classes of FPGAs emerging, for example, with cost structures that will be
extremely competitive. The Xilinx Zynq-7000 'Extensible Processing Platform' features a dualcore ARM Cortex-A9 surrounded by 235K logic gates shown in Figure 44. This logic can be
configured to run the computationally intensive tasks and control code can run within an SMP
OS on the ARM Cortex A9s selling for less than $15 at high volume66
Figure 44: Xilinx Zynq-7000 Extensible ProcessingPlatform
Processors and FPGAs are designed to solve different types of problems. While there is a
great deal of overlap in the kinds of problems that can be efficiently solved with both platforms,
there will always be cases where a processor is a better choice like running an operating system
or executing control code and where an FPGA will be a better choice like running highperformance, fixed-point computational tasks so while FPGAs may look attractive, a multi-core
processor may still be a far more optimal solution and if it is, a firms competitors may be taking
that route in which case, they may have no choice but explore multi-core as well.
65 http://www.bdti.com/InsideDSP/2010/09/02/JeffBierImpulseRespfonse
6 http://low-powerdesign.com/sleibson/2011/03/01 /xilinx-zynq-epps-create-a-new-category-that-fits-in-amongsocs-fpgas-and-microcontrollers/
Page 82 of 106
Another technology alternative is hardware accelerators. A hardware accelerator is an
implementation of a common system-level function that has been implemented as digital logic
within a processor design rather than as software that runs on the processors. Hardware
accelerators save embedded engineers a great deal of time because they don't need to source this
function from a third party or author this function from scratch.
Most hardware accelerators target specific functions that have become standardized in the
industry. For example, the H.264 digital video compression standard has stabilized over the last
decade and many embedded processors targeting certain applications will typically have a
hardware accelerator that performs the H.264 video compression (encoding) or decompression
(decoding) so the processor doesn't have to do this in software. For example, some variants of
the Texas Instruments' DaVinci processor which target video surveillance, include a hardware
accelerator that performs H.264 video encoding, 67. The NVidia Tegra 2 processor shown in
Figure 45 below and which powers the Motorola Xoom includes several accelerators for video
encoding and decoding, audio processing, 2D and 3D graphics and image processing.
Figure45: Nvidia Tegra 2 Processor
Another benefit of hardware accelerators is that they typically can deliver a much higher
level of performance and use significantly less power than the functions written in software
might. An EE-Times study in 2008 found that accelerators offered a 200x performance increase
over a software implementation at the same power dissipation level or a 90% power reduction at
the same performance level (Frazer, 2008).
67 http://www.security-technologynewscom/article/h264-encoder-digital-media-processors.html
Page 83 of 106
LIL
OPU*AW
0*64"B
at*
MN0g 1
I
WQ
Figure46: PerformanceandPowerGains with HardwareAcceleration (Frazer,2008)
Hardware accelerators are becoming much more common on embedded processors for two
key reasons. First, they have become less expensive to implement. Years ago, a processor
vendor would need to have some very large customers to justify the resources to add an
accelerator, but today, a smaller subset of customers can justify adding an accelerator because
gate costs at more advanced processing nodes are significantly less expensive as seen in Figure
47.
-- 130nm
son
---
0.1
0.01
0.001
0.01
1
0.1
10
100
1000
MGdsChip
Figure 47: Cost per gate by process node (Kumar, 2008)
Second, there is an entire industry consisting of more than 200 companies which has
emerged providing hardware IP blocks that suppliers can purchase and integrate into their
processors (McCanny, 2011).
Hardware accelerators have their share of challenges. Suppliers who develop accelerators
themselves will need to acquire a great deal of knowledge about the end application to ensure the
accelerator is a viable substitute for software that their customers typically develop themselves.
Furthermore, accelerators can introduce new forms of complexity into both the architecture and
the tool chain. For example, system developers may need to debug accelerator behavior
alongside software, but in many cases they're not connected to the hardware debug resources on
the chip and tool chains typically can't provide a unified view (Brutch, 2010). Furthermore,
accelerators present themselves as black boxes to developers using them, making them more
Page 84 of 106
challenging from a system debug perspective since developers don't have visibility into the
design and inner workings.
As a result, the challenges that multi-core approaches present may lead companies to
consider alternative approaches to achieving the required performance, and to do so in particular
where they might not otherwise have done so, thereby limiting the adoption of multi-core
processors in embedded systems.
Organizational Structure and Product Architecture
Conway's Law which was named after computer programmer Melvin Conway based on a
paper he wrote in 1968 states that an organization will produce designs that mirror the
communication structures of these organizations (Conway, 1968). Henderson and Clark suggest
that the structure of a dominant design will typically be mirrored in the organizational structures
(1990). This dynamic can be very powerful and often times is underestimated by the suppliers
providing technologies (such as new multi-core architectures) with great benefits but which also
require an architectural change of the firm adopting them.
There have been several examples of new technologies which offered major technological
benefits but for which companies weren't organizationally in a position to leverage the
technology. From the author's experience, a highly relevant example is the Blackfin Embedded
Processor from Analog Devices. This processor was co-developed by Analog Devices and Intel
and launched in 2000. The intent of the design was to displace both a DSP and a microcontroller
in an existing system by combining the salient attributes of these processors into a single
processor architecture. The architecture was named the 'MicroSignal Architecture'
68.
Despite
the cost and power benefits of removing a DSP and MCU from a design and combining these
functions into a single Blackfin, we found that firm's existing organizational structures were the
main impediment to adoption. First, these firms typically had two different teams developing the
software for the DSP and MCU; and second, these teams had very different skill sets. The team
programming the DSP typically consisted of electrical engineers and mathematicians while the
team writing the control code on the MCU consisted of computer scientists. And while these
two teams interacted, they were two separate teams. Requiring that these two teams architect a
single piece of software that maintained the throughput and real-time requirements that the DSP
engineers needed to maintain and also supported the large OS that the MCU engineers needed to
68
http://electronicdesign.com/article/digital/company-alliance-develops-micro-signal-dsp-archite.aspx
Page 85 of 106
run, was difficult at several levels. As a result, several firms have adopted Blackfin as a DSP or
an MCU, but firms wishing to migrate to an existing DSP+MCU architecture were often not in a
position to capitalize on the benefits. The cost and risk associated with reorganizing around a
single processor architecture and then trying to meet all of the conflicting DSP and control
system requirements didn't warrant the cost and power benefits of the Blackfin.
Factors and Dynamics Surrounding a Firm
Ecosystem
The ecosystem surrounding a processor is extremely important to embedded developers.
Within the ecosystem are the development tools, 3 rd parties providing libraries and operating
systems for that platform, contractors who are able to assist with product development, training
firms who can help organizations build skills around new architectures and many more. There
may also be an informal network of other engineers developing products using a processor
platform who provide support to each other on public forums69.
A strong ecosystem is critical to success for processor vendors because without a
development tool chain, for example, there would be no way to program the device. Without a
network of 3 rdparties providing libraries, firms adopting the processor would be forced to
implement common functions from scratch. If operating systems hadn't been ported to that
processor, firms would be either required to port an open source operating system themselves,
hire a commercial OS company to port their OS to that platform or develop some form of a
proprietary kernel themselves. Not surprisingly, 46% of embedded system engineers said that
the quality of the ecosystem surrounding a processor was the most important attribute when
selecting a processor scoring higher than the processor itself (43%) and the chip supplier (11%)
(UBM/EE-Times, 2011).
Because an ecosystem consists of several different components that all serve different roles
when developing a product with a processor, ecosystems can also suffer from a weakest link.
For example, a processor may have a rich set of options for compilers and debuggers but if that
processor is targeting a certain market for which certain algorithms are highly standard and that
firm doesn't have a library supplier in the ecosystem, it will likely be at a large disadvantage.
Most popular processor architectures have one or more public discussion forums that are not affiliated with
the silicon vendors or IP providers. For example, here is an ARM discussion group centered around programming
ARM processors with GCC/GNU tools: http://embdev.net/forum/arm-gcc
69
Page 86 of 106
A company like ARM has a massive ecosystem surrounding its processors. As of April 2011,
ARM's website listed 655 partners within its ecosystem providing libraries, operating systems,
development tools, design services, training and more70 . As processor architecture becomes
more popular, it becomes easier to attract participants in the ecosystem because these potential
ecosystem participants see larger and more secure opportunities to justify the investment in
developing resources for that processor. According to ARM's annual report, 6.1 billion chips
shipped in 2010 containing an ARM processor (2010). However, in the author's experience, a
company launching a new processor architecture may have a harder time attracting ecosystem
players because the future success of that processor isn't yet clear. The suppliers therefore, may
need to invest directly into 3 rd parties to seed the ecosystem or do the heavy lifting of creating the
components an ecosystem would normally deliver, in house.
Some suppliers bringing a new processor to market may be able to leverage an existing
ecosystem. For example, a supplier developing an ARM-based processor has an extremely large
ecosystem that they can leverage directly. In other cases, firms may also develop new
architectures that are designed in a manner that will allow them to at least capitalize on existing
design patterns.
A processor supplier developing a radical processor architecture that can't leverage any
existing ecosystems, will likely need to expend a lot more effort developing an ecosystem from
scratch, either within their company or via directing funding to outside 3 rd parties. This is
consistent with research around revolutionary technology introductions (Afuah, 2001). In
general, delivering revolutionary products requires much greater funding as the development
costs are higher since less existing technology is being leveraged; the return on investment will
typically be much further out into the future (Golda, et al., 2007).
Consider two entrants into the many-core embedded space: Stream Processors, Inc (SPI) and
Tilera. Both companies were founded in 2004 by MIT professors (former and current
respectively) and offered radical new approaches that had theoretically computation throughputs
well beyond what existing embedded processors and DSPs could offer. In 2007, SPI touted a
1Ox increase in cost/performance over existing DSPs with their Strom-I family of processors 71.
70
ARM's Partner Network: http://www.arm.com/community/partners/alI
partners.php
71 http://www.eetimes.com/electronics-products/processors/4091602/Startup-touts-stream-processing-
architecture-for-DSPs
Page 87 of 106
In 2007 Tilera claimed a 40x advantage over a mainstream high-end TI DSP on certain
benchmarks with the TILE64 processor 72
SPI's initial Storm-I family of processors shown in Figure 48 utilized a MIPS processor as a
system processor and a second MIPS processor to control a series of 16 parallel computation
units each with 5 computation sub-systems. The SPI also provided a C compiler, tool chain and
some libraries aligned to their target market of video surveillance, video conferencing and
multifunction printers73
Figure 48: SPI's High End SPJ6HP-G220ProcessorBlock Diagram
Tilera's initial offering was the TILE64 processor which consisted of 64 proprietary cores
arranged in a mesh configuration as shown in Figure 49. Tilera was focused on similar markets
digital multimedia and networking 7 4 but took a far more general purpose
approach to the architecture. Tilera supported Linux via a 64-way SMP implementation back in
2007 around the time of the product's launch. Tilera also claimed that 'unmodified' Linux
as SPI initially
-
applications could be built for the TILE64 using Tilera's tool chain 7 s. While SPI supported
Linux on the control processor, it required customers to work at a lower level and program the
processing elements directly and explicitly.
72 http://arstechnica.com/hardware/news/2007/8/MIT-startu-raises-multicorebrwithnew-64-coreCPU.ars/2
" SPI's website as of April 2008 http://replay.web.archive.or/20080420122343/http://wwwstreamprocessrs.com/
74' Tilera' s website as of April 2008
-
http://replay. web.archive.org/200804 1221l4 042/http://www.tilera.coml/
" http://www.mm4m.net/library/64 Core Processors.pdf
Page 88 of 106
s-fr
Figure 49: Tilera's TILE64 ProcessorBlock Diagram
Since then, Tilera has three products in mass production 76 and has established a formidable
ecosystem around their architectures including 23 3 'dparties listed on their site as of April of
201 1.
They promote several design wins on their site78 . As of October 2010, the latest Linux
kernel (2.6.26) now supports Tilera's TILETM architecture79 . SPI went out of business in 2009.
In an interview with Anant Agarwal, MIT professor and co-founder/CTO of Tilera
Corporation, Mr. Agarwal commented that providing a processor platform that not only delivers
high performance but is also general purpose have been important components of Tilera's
continued success. By providing a more general purpose architecture that supports SMP Linux,
standard tools and standard multi-core APIs and libraries like pthreads, OpenMP and TBB,
potential adopters are able to leverage existing knowledge, software and problem solving
techniques, easing the process of platform evaluation and adoption. The general purpose nature
of the architecture also means that it can be used to solve a more diverse set of problems across
more markets, and potentially larger markets.
He noted that the processors developed by other now defunct many-core suppliers were not
general purpose which required them to focus on specific markets and this created certain
challenges for these firms. For one, they were more tightly coupled to these market segments
and it was difficult to diversify into other segments. Furthermore, because of the lack of
standard tools, libraries and operating systems, their customers typically required them to
76http://www.tilera.com/about
tilera/testimonials
http://www.tilera.com/partners
7 http://www.tilera.com/about tilera/testimonials
79 http://www.tilera.com/about tilera/press-releases/linux-kernel-now-supports-tile-architecture
"
Page 89 of 106
develop large portions of the applications which placed additional support burdens on these
suppliers. (Agarwal, 2011).
As noted above, the strength of an ecosystem is generally more important to engineers than
the chip itself which ties back to several of the challenges highlighted in the previous sections of
this thesis. Companies which introduce new architectures that present a radical departure from
dominant design patterns may be required to develop a dedicated ecosystem around that
architecture which can be quite expensive. Alternatively, if processor vendors with a radical
product concept are able to leverage an existing ecosystem and design pattern, they may be able
to reduce the investment in their own ecosystem and appear as a lower risk to suppliers as
they're leveraging existing design patterns that may be well understood within firms adopting the
technology.
Established vs Start-up Behavior
Well-established firms may wait until the revolutionary technology gets a foot-hold before
making substantial investments (Golda, et al., 2007). We see evidence now of several major
companies developing multi-core processors based on ARM Cortex A9 and the follow-on A15
like Marvel, TI, Samsung as that platform has become fairly dominant. Companies like
Freescale are developing SMP processors based on their own core technology but which will be
able to support the existing SMP ecosystem. This technology presents itself as more
evolutionary than revolutionary since existing applications can be more easily ported and then
further tuned for performance later. However, we haven't seen major movements into more
radical approaches from major suppliers yet.
Supplier Mortality
Another key adoption dynamic is start-up mortality. Start-ups pursuing radical technologies
may require a larger amount of funding because the technology development will be more
expensive and it will take longer to get a return on investment. Golda and Philippi describe the
'valley of death' as shown in Figure 50 as a period within the first several years of a firm
developing a radical technology in which the rate of spending is high but the rate of earnings is
low. However, if the radical technology is widely adopted, the return on investment over time is
significant.
Page 90 of 106
Figure 50: "Valley of Death"for a revolutionarytechnology (Golda, et al., 2007)
Alternatively, suppliers developing more incremental or evolutionary products require a
small investment but also may not see as large a potential return over the long run.
15
Smaller, Faster
Return
10
5.
10
Smaller
-10.
Investment
Figure 51: "Valley of Death" for an evolutionarytechnology (Golda, et al., 2007)
The development costs for a fabless startup developing a new processor architecture are quite
high. Jim Turley of Microprocessor Report offered several insightful observations on the
business climate for fabless startups. He estimates that a fabless startup developing a new
processor will need $100M to survive. Half of this is for product development which will
typically take about 2 years from the first design efforts to working silicon. The other half is
spent on securing those first critical design wins - marketing, sales, support, etc. Once the
processor is ready, it typically takes a year to secure that first design win and those willing to
take a risk on a brand new technology from a brand new startup are typically start-ups
themselves. Revenue won't start arriving until these customers develop their products and bring
these to market which could be months to years depending on the market those customers serve
(Turley, 2010). To add insult to injury, once the first core is ready, development needs to begin
on the next generation product that will likely need to deliver higher performance and lower
power at a lower cost. There is a long list of startups in this space that have failed or have been
acquired for assets. In recent history, Stream Processing Inc (SPI), SiCortex, Mathstar, Ambric,
Raport and Quicksilver have all vanished or been acquired for assets8 '81'82 83 Older companies
**http://www.eetimes.com/design/signal-processing-dsp/4017733/Analysis-Why-massively-parallel-chipvendors-failed .
81http://www.biziournals.com/saniose/stories/2009/11/02/daily124.html
82 http://www.hpcwire.com/hpcwire/2009-05-27/powered
down sicortex to sell off assets of company.html
Page 91 of 106
doing concurrent processing that are no longer in business include Ardent, Convex, Encore,
Floating Point Systems, Inmos, Kendall Square Research, MasPar, nCUBE, Sequent, Tandem,
and Thinking Machine (Patterson, 2010).
As a manager selecting suppliers, the perceived health and growth prospects for a new
startup is likely an essential selection criterion. A new processor startup may have a technology
that offers unparalleled benefits which could provide significant end product differentiation to
the firm adopting it. However, if the supplier goes bankrupt in 2 years, the adopting firm would
be in a far worse position as they'd likely need to redesign their entire product. This is
particularly important for more radical multi-core processor designs because they typically
require a higher degree of platform coupling - it's possible that migrating to another processor
from another firm would be almost as much work as migrating to their processor.
Human / Cognitive Factors and Dynamics
There are several important and relevant cognitive and behavioral factors which impact the
way we think and make decisions. This section will explore some of these factors and how they
may be impacting the perception and adoption of multi-core processors.
Thinking "Concurrently"Is Fundamentally Different
There are two important fundamental challenges associated with developing applications for
multi-core processors that we will explore. The first challenge is that thinking concurrently
doesn't come naturally to people. The second challenge is that solving problems with sequential
software is a process that is deeply ingrained in most programmers.
Since the concept of the "Turing Machine"8 4 , modem mainstream computer architectures
have embodied the same basic paradigm - processors operate on a sequence of instructions that
are arranged to solve a given task and use a local memory system to store instructions and state
information.
83 http://www.eetimescom/'electronics-news/4155235/QuickSilver-closes-operations-shops-IP
s
In 1937, Alan Turing, a British cryptologist, published a paper about a conceptual mechanical device that
would be capable of computing any mathematical expression. This conceptual machine consisted of three parts.
First, there is an endless piece of tape onto which information could be stored and retrieved - because the tape is
endless, it can hold an infinite amount of information. Second, there is a read/write head that can transfer
information to and from the tape. Finally, there is an application which performs a sequential set of functions on the
information (Turing, 1936).
84
Page 92 of 106
Lynn Stein refers to this as a 'computational metaphor' through which we approach problems
- "The computational metaphor is an image of how computing works-or what computing is
made of-that serves as the foundation for our understanding of all things computational."
(Stein, 1996)
This pattern or 'metaphor' of sequential problem solving is not unique to the computer
science profession and may actually be more ingrained in our education system and history.
Some research shows that various forms of sequential thinking are taught in primary school
education, and this approach to problem solving is used in areas like biology and informational
science (Stein, 1996).
As people develop knowledge about a specific problem solving technique for a certain class
of problems, they may not be able to consider all of the alternatives when faced with a new kind
of problem (Henderson, et al., 1990). As Henri Bergson once said, "The eye sees only what the
mind is prepared to comprehend" - Henri Bergson (pg 24 of business dynamics)
Furthermore, the manner in which a problem is represented will impact the breadth,
creativity and overall quality of the solutions provided by the problem solvers. As Nancy
Leveson states in her paper on Intent Specification, "Not only does the language in which we
specify problems have an effect on our problem-solving ability, it also affects the errors we make
while solving those problems." (2000)
This implies that we have a method of problem solving that is highly ingrained not only in
the software industry but almost at a societal level. Second, because that knowledge is so deeply
ingrained, we may have a difficult time imaging alternative solutions when faced with a different
class of problems. It is no surprise that the areas in which multi-core adoption is happening
fastest (general purpose computing), a model (SMP) has been created that allows the existing
thread paradigm to persist in a multi-core environment.
"humans are quickly overwhelmed by concurrency andfind it much more difficult to
reason about concurrentthan sequentialcode. Even careful people miss possible
interleavings among even simple collections ofpartiallyordered operations."
(Sutter, et al., 2005)
Psychology of Gains. Losses &Endowment
John Gourville cites the work of Nobel Prize winning psychologist Daniel Kahneman and
psychologist Amos Tversky showing that humans, when considering product adoption, weigh
Page 93 of 106
what they are going to lose by adopting that product far more heavily than what they stand to
gain -an idea known as loss aversion. These studies showed that the gains must outweigh the
losses by two to three times for people to find an offering attractive. This leads to what
Gourville calls the "Endowment Effect" where people place higher value on the products that
they already have, than those they don't yet have (2006). He also notes that most people are
completely unaware of their tendency to do this.
This has some potentially powerful implications for the adoption of multi-core processors. A
firm's knowledge assets are part of its endowment and whether the individuals making platform
decisions are aware of this or not, multi-core products which may require the destruction of a
large amount of knowledge assets may have a deeper impact on the selection process than
perhaps previously considered. This may imply that if a firm is able to present a multi-core
technology that preserves endowment with less powerful technology may be more successful
than firms which have more powerful technology but adoption would be more destructive to the
endowment/knowledge assets.
Variety & Choice Overload
Research by Sheena lyengar shows that people find a greater number of choices debilitating
and that once they've made a choice, they are generally less satisfied (Iyengar, 2010). Meta
research by Benjamin Scheibenhenne on the various psychological studies around choice
suggests that it may not be the number of choices but rather the amount of information that is
involved in the decision and the amount of familiarity the user has with the objects they are
selecting between. People who are less familiar with the options may experience a greater
amount of frustration after selecting (Mogilner, et al., 2008).
As noted earlier, there is less mobility within the design space of multi-core processors than
single core processors. Additionally, there may be several potential configurationsthat could
solve the problem and given the difficulty in predicting the performance of multi-core systems,
it's often difficult. Choice overload may play a role in the adoption of multi-core because there
are many different choices, adopters have little familiarity with each, they have an inability to
accurately predict their success up front and these adopters may be more coupled to their
decisions for some time.
Page 94 of 106
9. Adoption Heuristics
Now that we have established categories of phenomena and examined the causal mechanisms
the drive the observed pattern of adoption for multi-core processors in embedded systems, the
final section of thesis will bring these together and use this well-grounded theory to define a set
of heuristics for adopting companies in the embedded systems domain and their suppliers, or
would-be suppliers, of multi-core processors.
Heuristics for Adopters of Multi-core Embedded Processors
First, we will present a set of heuristics which firms looking to adopt a multi-core embedded
processor in a design may use to predict short and long term success with a multi-core processor
platform. The key considerations include the nature of the processing, the existence of legacy
code, the competencies of the firm and the supporting ecosystem
Nature of the processing
If that nature of the processing is more general purpose, meaning that the nature of the
applications that will run on the system aren't specifically known at the time of development, an
SMP approach will provide the flexibility and some amount of power and performance benefits.
If the nature of the processing is more fixed-function, the design may require a greater deal of
optimization against that function to be competitive. If the processing has hard real time
constraints, delivering hard real time with an SMP approach may require additional skills and
time.
If the nature of the function to be implemented is best done in software (heavy on control and
conditional code, floating point math, etc), a processor platform is likely ideal. However, if the
function could be more efficiently implemented in hardware, FPGAs or acceleration should be
considered.
If significant portions of the processing tasks are based on industry standard functions (i.e.
H.264 decode) and the end product differentiation strategy is not based around delivering
premium performance of this function, hardware acceleration may be more economical than
implementing the functions in software. If the processing tasks are not standard functions, it is
more likely that these will need to be developed in-house or be contracted out.
Page 95 of 106
Existence of Legacy Code
If the nature of the processing is more general purpose and POSIX-compliant multi-threaded
legacy code exists, an SMP operating system running on an SMP processor may be optimal.
However, if legacy code came from a 3'd party, it may not run properly in an SMP configuration
and it may not be partitionable, particularly if source code isn't available.
If the nature of the processing is more fixed function and legacy code exists, that code will
need to be decomposed / partitioned to run on multiple cores.
If an existing hardware architectural design pattern exists like the ones covered that will
facilitate porting, this can significantly ease the migration process. However, if a design pattern
doesn't exist, a more significant amount of work may be required.
If legacy was written in a modular fashion, it is possible that the existing decomposition will
allow for clean segmentation across processing resources. This of course isn't guaranteed - it's
equally possible that the decomposition strategy does not lend itself well at all for concurrency
and that the application performance could decrease. However, if the architecture of the legacy
software can be characterized as integral, a certain amount of work will be required to
decompose the legacy code into a modular structure that can be partitioned across cores. Of
course, there is similar risk that the wrong decomposition strategy is chosen and multi-core
implementation doesn't deliver any incremental benefit over an existing implementation.
If the application lends itself to being split across processing elements (data or instruction
parallelism), an optimal software architecture may be more clear. However, if the opportunities
for parallelism aren't immediately obvious, more work may be required to get incremental
benefits over an existing implementation.
To summarize, if existing POSIX-compliant code is available, rearchitecture may not be
required, but changes are likely needed to protect against the issues found with threads and
multi-core.
Competencies
If the firm has experience segmenting applications across multiple processors, there are
likely architects within the firm who understand how to predictably develop, debug and deploy
concurrent applications. However, if no developers have any experience developing concurrent
Page 96 of 106
applications, the likelihood of making incorrect architectural decisions or suffering from long
debug periods as they learn how to track down new types of emergent bugs will increase.
If the firm doesn't have expertise, it can hire experts or consultants to help. They may help
the firm avoid common pitfalls that companies going through the process for the first time make.
However, if the firm isn't in a position to hire or contract, there may be a steep learning curve
associated with the project and development could take longer than expected.
If the multi-core processor relies on standard tools, the learning curve may be more shallow
for developers new to the architecture. If new tools need to be learned, this may also mean that
existing development and debug skills that are coupled to the existing skills may not be
transferrable.
Suppliers &Ecosystem
If there are multiple multi-core processor vendors whose products could potentially fit, the
firm will have the option of selecting suppliers who appear to be in a better position financially
and who have a greater likelihood of being around in the future. If there are a fewer number of
firms whose products can meet the performance requirements and those firms have been around
for fewer than 5 years and are still pushing their first generation of silicon, they may not be
around for much longer. However, if there are multi-core processor solutions from established
firms or even recent startups who have delivered multiple generations of products and have
multiple high-profile design wins, these vendors may be safer long term bets.
If the processors being considered have an established ecosystem (tools, operating systems,
libraries, contractors, etc), these pieces may be leveraged to expedite the design process. If the
ecosystem is nascent or nonexistent, the firm will need to develop more foundational pieces of
the design from scratch and may not have access to the system partitioning, tuning, debug and
optimization tools which may increase the development time.
The ecosystem players may also be new to multi-core. For example, a supplier may take an
existing processor core which has a rich ecosystem and offer a dual core version. In this case,
the existing ecosystem players may not have the level of maturity in concurrent design to deliver
reliable components.
Heuristics for Suppliers of Multi-core Embedded Processors
Page 97 of 106
For suppliers determining whether or not to launch a multi-core processor, the following
heuristics may predict the viability of a multi-core processor and can be used as a prescription to
help ensure success for new processors.
Product Attributes
If the processor offers revolutionary levels of performance along key trajectories, companies
may be willing to invest more in adopting the product. However, if the processor provides only
incremental performance levels over competing products or existing solutions, the investment
level adopting firms may be willing to make will be lower.
If the processor is easy to program, supports C programmability and existing software design
patterns like SMP and standard multi-core libraries/APIs like pthreads, openMP and MCAPI, it
will increase the likelihood that it is compatible with existing designs and it will also be
perceived as easier to develop with. However, if the processor requires the use of proprietary
tools and libraries to program, it will be represent a much larger investment to the adopting firm
as they will need to establish a greater deal of tacit knowledge around the product. They may
lose platform mobility in the process.
If the processor is easy to evaluate and a potential adopter can see performance increases
without spending a great deal of time learning the architecture and porting code, they may have
the confidence that they'll be able to incrementally improve performance from there to achieve
desired performance. This will reduce the cost and risk associated with adoption. However, if a
potential adopter needs to rearchitect their application to run on this device, it may not be worth
their time to even evaluate it.
Target Market
If the processor is designed to target a specific or a small number of specific market segments,
it will likely be a more specific implementation that can't be applied to problems in other
markets. The vendor may need to be prepared to develop most of the applications themselves
for the target customers and the support load may be more significant. Furthermore, if those
markets don't pan out, it may be more difficult to apply the same technology to other markets.
If the processor is designed to be a more general purpose processor, it will compete with other
solutions (hardware acceleration, FPGAs, etc) optimized for specific segments. It may be
differentiated through flexibility and programmability in which case SMP and an ecosystem of
Page 98 of 106
support for industry accepted operating systems like Linux and tools like GNU/GDB/Eclipse
may help.
Target Customers
If the developers in the target market have a history with algorithm partitioning across cores,
they may have the skills necessary to develop a successful product. However, if the developers
have largely been developing applications for single core processors, they may need to do a great
deal of work (and learning) to get their applications ported to this processor architecture.
If this product will be compatible with the existing organizational structures and skill sets
within the target customers, it will be an easier sell. However, if this processor fundamentally
changes the way adopters develop, deploy and support products in the field, it may represent an
incredibly costly investment. For example, if firms traditionally used in-house ASIC teams and
this represents a programmable solution, in addition to developers, there may also be an army of
field support people who are trained on the ASIC-based solution who would need to be entirely
retrained to support a product built around a fundamentally different technology.
If the potential customers are risk averse, it may be difficult finding a firm willing to take a
chance with a fabless startup, particularly for adopters who have long product life cycles. If the
performance benefit isn't high and the switching costs low, it will be difficult to get the first
design win.
Architectural Compatibility
If this processor will require customers in the focused segments to rearchitect their software
to run on multiple cores, adoption will represent a significant investment for them.
Competitive Landscape
If the processor required architectural, organizational changes along with several new
knowledge assets, the benefit will need to be incredibly high to offset the costs of adoption. If
the processor architecture is compatible with the existing product architecture, organization and
skills, a lower incremental benefit may still be enough impetus for potential customers to switch.
Ecosystem
If the-supplier has established an ecosystem around the product to deliver tools, libraries,
design support, this may take risk out of the platform for potential adopters. However, if the
Page 99 of 106
adopters will need to do more heavy lifting, it will increase the cost and risk associated with
adoption. Furthermore, if the ecosystem is based on proprietary tools, adopters may need to
spend more time learning the new tools.
Page 100 of 106
10.
Conclusions
In this thesis, we have reviewed several concepts related to the technology architecture and
dynamics of technology adoption. We have explored several concepts related to embedded
systems, parallel computing and multi-core processors. We have examined architectural design
scenarios related to multi-core adoption and have shown that the existence of design patterns can
facilitate the adoption of multi-core embedded processors. We examined several other adoption
mechanisms related to the management strategy and organization of the adopting firm, the
ecosystem of the product and cognitive / behavioral factors. Finally, we provided a set of
heuristics that attempted to predict the level of difficulty an adopter of embedded multi-core
processor will likely face and a similar set of heuristics on the supply side for a semiconductor
supplier bringing a new multi-core processor to market.
Page 101 of 106
11.
Appendix A: Libraries and Extensions Facilitating the
Programming of Multi-core Processors
Intel Threading Building Blocks (TBB)
Intel Threading Building Blocks (TBB) is a library developed by Intel specifically designed
for expressing parallelism in multi-core applications 85. It is supported on Linux, QNX, FreeBSD
as well Windows, OS-X and Solaris8 6. TBB allows software developers to work at an
abstraction level above the platform architecture and threads and offers more scalability and
performance.
OpenMP
OpenMP is another multiprocessing standard/API that consists of compiler directives
(#pragmas) which are used to aid the compiler in parallelizing regions of code via threads (Gove,
2011). While pthreads can be used for course and fine grain parallelism, OpenMP is more suited
for finer-grain parallelism (Freescale Semiconductor, 2009).
MPI and MCAPI
MPI (message passing interface) is a standard library used for synchronously and
asynchronously passing messages between processors that was originally developed for older
homogeneous distributed multiprocessor systems (Marwedel, 2011).
MCAPI is similar and
more modem standard that targets multi-core and more tightly coupled multiprocessor systems.
It is designed to provide a low latency interface which utilizes interconnect technologies on
modem multi-core homogeneous and heterogeneous multi-core systems. (The Multi-core
Association, 2011).
85http://threadingbuildingblocks.org/
86
http://threadingbuildingblocks.org/file.php?fid=86
Page 102 of 106
12.
Bibliography
Afuah Allan Dynamic Boundaries of the Firm: Are Firms Better off Being Vertically
Integrated in the Face of Technology Change [Journal] // The Academy of Management
Journal. - 2001. - pp. 1211-1228.
Alexander Christopher A Pattern Language [Book]. - Oxford: Oxford University Press,
1977.
AnandTech Nehalem - Everything You Need to Know about Intel's New Architecture
[Online] // AnandTech. - 11 03, 2008. - 04 20, 2011. - http://www.anandtech.com/show/2594/10.
ARM Holdings ARM [Online] // ARM Annual Report - Non-Financial KPIs. - 2010. - 04
23, 2011. - http://www.arm.com/annualreportl0/downloadcentre/PDF/ARM%20AR%2Ooverview.pdf.
Baldwin Carliss Young and Clark Kim B. Design Rules: The power of modularity
[Book]. - Cambridge : The MIT Press, 2000.
BDTI Analysis: Why massively parallel chip vendors failed [Online] // EE-Times. - UMB, 1
21, 2009. - 04 29, 2011. - http://www.eetimes.com/design/signal-processingdsp/4017733/Analysis-Why-massively-parallel-chip-vendors-failed.
Bergland G.D. A Guided Tour of Program Design Methodologies [Journal] / IEEE. 1981.- pp. 13-35.
Bhujade Moreshwar Parallel Computing [Book]. - New Delhi : New Age International
Limited, Publishers, 1995.
Brutch Tasneed Software Development Tools for Multi-core Systems [Webcast]. - [s.l.]:
EE Times, 2010. - Vol. September 24th.
Christensen Clayton M and Raynor Michael E Why Hard-Nosed Executives Should Care
About Management Theory [Article] // Harvard Business Review. - September 1, 2003.
Conway Melvin E How Do Committees Invent? [Article] // Datamation Magazine. - April
1968.
Crawley E ESD.34 - Lecture 6, IAP 2009 [Conference]. - 2009. - p. 39.
Culler David E, Singh Jaswinder Pal and Gupta Anoop Parallel computer architecture: a
hardware/software approach [Book]. - San Francisco : Morgan Kaufmann Publishers, 1999.
Databeans 2008 Microcontrollers - Semiconductor Product Markets - Worldwide [Report]. -
[s.l.] : Databeans, 2008.
Dietrich Sven-Thorsten A Brief History of Real-Time Linux, Linux World 2006. - Raleigh:
[s.n.], 2006.
Eick Stephen G. [et al.] Does Code Decay? Assessing the Evidence from Change
Management Data [Article] / IEEE Transactions on Software Engineering, vol 27, no 1. Jan/Feb 2001. - pp. 1-12.
Emcore Magazine Test, test and test again [Article] // Emcore Magazine. - 09 2010. - pp.
16-17.
Feng WuChun Making a Case for Efficient Super [Journal] // Queue. - 2003. - pp. 54-64.
Frazer Rodney Reducing Power in Embedded Systems by Adding Hardware Accelerators
[Online] // EE Times. - 04 09, 2008. - 04 20, 2011. http://www.eetimes.com/design/embedded/4007550/Reducing-Power-in-Embedded-Systems-byAdding-Hardware-Accelerators/.
Page 103 of 106
Freescale Semiconductor Embedded Multi-core: An Introduction [Online] //
freescale.com. - 2009. - 01 01, 2011. -
www.freescale.com/files/32bit/doc/ref_ manual/EMBMCRM.pdf.
Gartner Research Gartner Says Worldwide PC Shipments to Increase 19 Percent in 2010
with Growth Slowing in Second Half of the Year [Online] // Gartner Newsroom. - August 31,
2010. - 04 11, 2011. - http://www.gartner.com/it/page.jsp?id=1429313.
Gentile Rick Processor Applications Engineering Manager [Interview]. - 04 19, 2011.
Golda Janice and Philippi Chris Managing New Technology Risk in the Supply Chain
[Article] // Intel Technology Journal, Volume 11, Issue 2. - 2007. - pp. 95-104.
Gourville John Understanding the Psychology of New-Product Adoption [Journal] //
Harvard Business Review. - 2006. - pp. 99-106.
Gove Darryl Multi-core Application Programming for Windows, Linux and Oracle Solaris
[Book]. - Boston: Addison-Wesley Professional, 2011.
Henderson Rebecca and Clark Kim Architectural Innovation-The Reconfiguration of
Existing Product Technologies [Article] // Administrative Science Quarterly. - 1990. - pp. 9-30.
Hovsmith Skip Getting started with multi-core programming: Part 1 [Online] // EE Times. UBM, 7 7, 2008. - 04 21, 2011. - http://www.eetimes.com/design/embedded/4007623/Gettingstarted-with-multi-core-programming-Part-1.
IDC Worldwide Server Market Rebounds Sharply in Fourth Quarter as Demand for Blades
and x86 Systems Leads the Way, According to IDC [Online] // IDC. - 02 24, 2010. - 04 11,
2011. - http://www.idc.com/getdoc.jsp?containerld=prUS222245 10.
intel http://www.cs.umbc.edu/portal/help/architecture/24531701.pdf [Article] // Intel IA-64
Architecture Software Developer's Manual. - January 2000. - pp. 6-21.
Iyengar Sheena The Art of Chosing [Book]. - New York: Hachette Book Group, 2010.
Kuhn Thomas The Structure of Scientific Revolutions (2nd ed) [Book]. - Chicago:
University of Chicago Press, 1970.
Kumar Rakesh Fabless Semiconductor Implementation [Book]. - [s.l.] : McGraw-Hill,
2008.
Kundojjala Sravan Baseband Vendors Will Take One-Third of the Smartphone Multi-Core
Apps Processor Market in 2011 [Online] // strategyanalytics.com. - 01 19, 2011. - 03 19, 2011. http://blogs.strategyanalytics.com/HCT/category/Handset-Component-Technologies.aspx.
Lee Edward A The Problem with Threads [Report]. - Berkeley : University of California at
Berkeley, 2006.
Leveson Nancy Intent Specifications: An Approach to Building Human-Centered
Specifications [Article] // IEEE Transactions on Software Engineering. - 1 2000. - pp. 15-35.
Leveson Nancy Software Engineering: A Look Back and A Path to the Future [Online]. December 14, 1996. - 04 04, 2011. - http://sunnyday.mit.edu/16.355/leveson.pdf.
Levy Marcus The Adoption of Multi-core [Online] / National Instruments. - 06 02, 2008. 0101,2011.
-
http://ni.adobeconnect.com/p77017465/?launcher=false&fcsContent-true&pbMode=normal.
MacCormack Alab, Rusnak John and Baldwin Carliss Exploring the Structure of
Complex Software Designs: An Empirical Study of Open Source and Proprietary Code
[Article] // Management Science. - 2006.
McCanny Jim Trust but Verify: increasing IP incorporation in 2011 SoC Design [Online] //
Chip Design Magazine. - Spring 2011. - 04 20, 2011. http://chipdesignmag.com/display.php?articleld=4800.
McKenney Paul E Is Parallel Programming Hard, And, If So, What Can You Do About It
[Book]. - Beaverton: IBM Linux Technology Center, 2010.
Page 104 of 106
Mogilner Casie, Rudnick Tamar and Iyengar Sheena The Mere Categorization Effect:
How the Presence of Categories Increases Choosers [Journal] // Journal of Consumer Research. 2008. - pp. 202-215.
Moore Gordon E Cramming more components onto integrated circuits [Journal] //
Electronics. - 1965.
Moynihan Finbarr Marketimng Director, Mediatek [Interview]. - Boston: [s.n.], 04 20,
2011.
Murmanna Johann Peter and Frenken Koen Toward a systematic framework for research
on dominant designs, technological innovations, and industrial change [Article] // Research
Policy. - 2006. - Vols. 25, pgs 925-952.
Niebeck Bob MIT ESD.36 Guest Lecturer - Fall 2009. - 10 2009.
Norman D.A. Things That Make Us Smart [Book]. - [s.l.] : Addison Wesley Publishing
Company, 1993.
Parnas D.L. On the Criteria To Be Used in Decomposing Systems into Modules [Journal] //
Communications of the ACM. - 1972. - pp. 1053 - 1058.
Parsons David Object Oriented Programming with C++ [Book]. - New York : Continuum,
1994.
Patterson David The Trouble with Multicore [Article] / IEEE Spectrum. - July 2010. - pp.
28-32, 52-53.
Rajan Hridesh, Kautz Steven M. and Rowcliffe Wayne Concurrency by Modularity:
Design Patterns, a Case in Point [Conference] // Onward!. - Reno : [s.n.], 2010. - Vols. pg 790805.
Raymond Eric The Cathedral and the Bazaar [Online] // Eric S. Raymond's Home Page. - 08
02, 2002. - 04 20, 2011. - http://www.catb.org/-esr/writings/homesteading/cathedral-bazaar/.
Richie Dennis M The Development of the C Language [Online] // Bell Labs. - 2003. - 04 11,
2011. - http://cm.bell-labs.com/cm/cs/who/dmr/chist.html.
Sandia Labs More chip cores can mean slower supercomputing, Sandia simulation shows
[Online] // Sandia Labs. - 1 12, 2009. - 04 21, 2011. https://share.sandia.gov/news/resources/newsreleases/more-chip-cores-can-mean-slowersupercomputing-sandia-simulation-shows/.
Simon C.A. and Simon H.A. In search of insight [Journal] // Cognitive Psychology. 1990.-p.vol22.
Simon Herbert A The Architecture of Complexity [Journal] // Proceedings of the American
Philosophical Society. - 1962. - pp. 467-482.
Simon Herbert A The Architecture of Complexity [Conference] // Proceedings of the
American Philosophical Society. - [s.l.] : American Philosophical Society, 1962. - Vols. 106, No
11, p4 6 7 -4 8 2 .
Starnes Tom Software Development Tools for Multicore Systems, Approaching Multicore
Conference [Webcast]. - [s.l.] : EE Times, 2010. - Vol. September 24th.
Stein Lynn Challenging the Computational Metaphor: Implications for How We Think
[Report]. - 1996.
Sterman John D. Business Dynamics: Systems Thinking and Modeling for a Complex
World [Book]. - Cambridge : McGraw Hill, 2000.
Sutter Herb and Larus James Software and the concurrency revolution [Article] // ACM
Queue. - 2005. - September. - 7 : Vol. 3.
Tabirca Sabin Introduction to Parallel Computing [Online] // Department of Computer
Science - University of College Cork. - 09 06, 2003. - 04 15, 2011. -
http://www.cs.ucc.ie/-stabirca/AM601 1/llnlpp/index.htm.
Page 105 of 106
Turing A. M. ON COMPUTABLE NUMBERS, WITH AN APPLICATION TO THE
ENTSCHEIDUNGSPROBLEM [Article] // Proc. London Math. Soc.. - May 27, 1936. - pp. 230264.
Turley Jim Editorial: How to Blow $100 Million [Online]. - 09 23, 2010. - 01 20, 2011. http://www.mdronline.com/editorial/edit24_34.html.
UBM/EE Times Group 2010 Embedded Market Study [Report]. - [s.l.] : UBM/EE Times
Group, 2010.
UBM/EE-Times 2011 Embedded Market Study [Online] // EE Times. - 04 08, 2011. - 04 08,
2011. - http://www.eetimes.com/electrical-engineers/education-training/webinars/4214387/2011Embedded-Market-Study.
Uhrig Sascha Evaluation of Different Multithreaded and Multicore Processor Configurations
for SoPC [Conference] // SAMOS '09 Proceedings of the 9th International Workshop on
Embedded Computer Systems: Architectures, Modeling, and Simulation. - Heidelberg
Springer-Verlag Berlin, 2009. - pp. 68-77.
Utterback James Mastering the Dynamics of Innovation [Book]. - Cambridge : Harvard
Business School Press, 1994.
VDC Research Embedded Engineers Experience with Multicore and/or Multiprocessing
Designs [Online] // VDC Research. - 02 15, 2011. - 04 06, 2011. http://blog.vdcresearch.com/embeddedsw/2011/02/embedded-engineers-experience-withmulticore-andor-multiprocessing-designs.html.
VDC Research Executive Brief: 2010 Embedded Processors, GLobal Market Demand
Analysis [Online] // VDC Researcg. - 02 2011. - 04 11, 2011. http://www.vdcresearch.com/_Documents/tracks/tlv lbrief-2637.pdf.
VDC Research VDC Research [Online] // 2010 Service Year Track 2: Embedded System
Engineering Survey Data, Vol 5: Procesor Architecture Executive Brief. - 09 2010. - 04 02,
2011. - http://www.vdcresearch.com/_Documents/tracks/t2v5brief-2627.pdf.
Williamson Oliver E The Economic Institutions of Capitalism [Book]. - New York : The
Free Press, 1985.
Page 106 of 106
Download