Platform Leadership in Open Source Software

advertisement
Platform Leadership in Open Source Software
By
Ken Chi Ho Wong
Bachelor of Science, Computing Science, Simon Fraser University, 2005
SUBMITTED TO THE SYSTEM DESIGN AND MANAGEMENT PROGRAM
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE IN ENGINEERING AND MANAGEMENT
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2015
©2015 Ken Wong. All rights reserved.
The author hereby grants to MIT permission to reproduce and to distribute publicly paper and
electronic copies of this thesis document in whole or in part in any medium now known or
hereafter created.
Signature of Author: ___________________________________________________________
Ken Wong
System Design and Management Program
February 2015
Advised by:
___________________________________________________________
Michael Cusumano
SMR Distinguished Professor of Management & Engineering Systems
MIT Sloan School of Management
Certified by:
___________________________________________________________
Patrick Hale
Director, System Design and Management Program
Massachusetts Institute of Technology
Platform Leadership in Open Source Software
This page is intentionally left blank.
Platform Leadership in Open Source Software
By
Ken Chi Ho Wong
Submitted to the System Design and Management Program on February 2015,
in Partial Fulfillment of the Requirements for the degree of
Master of Science in Engineering and Management.
Abstract
Industry platforms in the software sector are increasingly being developed in open source. Firms
seeking to position themselves as platform leaders with such technologies must find ways of
operating within the unique constraints of open source development. This thesis aims to
understand those challenges by analyzing the Android and Hadoop ecosystems through an
augmented version of Porter’s Five Forces framework proposed by Intel’s Andrew Grove.
The analysis finds that platform contenders in open source behave differently depending on
whether they focus on competing against alternative platforms or alternative providers of the
same platform as rivals. This focus informs key decisions that the firm takes, including how it
interacts with complementors and its approach to innovation. Due to the fact that open source
vendors tend to lack unilateral authority over technology decisions, they can only seek to
influence the behavior of the ecosystem by securing key relationships in the value network. In
particular, they must secure the right engineering talent, access to key complements and superior
paths to the customer.
The research highlights some of the factors and tactics platform contenders in Hadoop and
Android considered in acquiring these relationships. The open nature of FOSS (Free and Open
Source Software) also allow new technologies to emerge and change the definition of the
platform’s boundaries. This creates a further strategic challenge for open source platform
contenders.
Keywords: platform strategy, platform leadership, open source software, Hadoop, Android
Thesis Supervisor: Michael Cusumano
Title: SMR Distinguished Professor of Management & Engineering Systems
MIT Sloan School of Management
i
Platform Leadership in Open Source Software
This page is intentionally left blank.
ii
Acknowledgement
This thesis was made possible by a number of individuals who generously shared their
time and expertise with me. There are only a few names on the cover of this document, but the
content within contains the wisdom and contributions of so many more.
I would especially like to thank Professor Michael Cusumano for his guidance and advice
throughout the entire journey. The breadth of his knowledge and depth of his insights on all
things related to platform strategy is simultaneously humbling and inspiring.
My understanding of the Hadoop ecosystem was greatly informed by a number of
enlightening conversations I’ve had with the thought leaders of that market. I am tremendously
grateful to Rob Bearden (CEO of Hortonworks), Ron Kasabian (GM of Big Data at Intel) and
Mike Olson (Founder and Chief Strategy Officer of Cloudera) for taking time to indulge the
curiosity of a student. The case study that sits at the heart of this thesis would not have been
possible without their assistance.
The time I spent at MIT was also enabled by the fantastic support I received from my
colleagues in SAP’s Analytics Division. In particular, I would like to thank Jesse Calderon, Don
Wakefield and Michael Reh for their sponsorship and encouragement during the past two years.
Though I am no longer a part of SAP, I will take the many things I’ve learned from these leaders
forward with me. The same goes to Pat Hale and the fantastic staff of the SDM program.
Finally, I would like to thank my family for their unwavering support and many sacrifices
that made it possible for me to complete my studies. To my amazing wife Sharon and the active
bundle of joy that she is currently carrying in her tummy: Completion of this program is made
even sweeter by the knowledge that I now have more time to spend with you. Your love is a true
blessing from God.
iii
Platform Leadership in Open Source Software
This page is intentionally left blank.
iv
Table of Content
Abstract ................................................................................................................................ i
Acknowledgement ............................................................................................................. iii
Table of Content .................................................................................................................. v
Introduction ......................................................................................................................... 1
Approach and Structure ...................................................................................................... 3
Chapter 1 – Literature Review ............................................................................................ 5
Network Effects .............................................................................................................. 5
Product vs. Industry Platforms ....................................................................................... 7
Two-Sided Markets ......................................................................................................... 7
Topology of Platform Roles and Openness in a Platform-Mediated Network ............... 9
Platform Leadership and the “Four Levers” Framework ...............................................11
Lever 1: The Scope of the Firm .................................................................................11
Lever 2: Product Technology .................................................................................... 12
Lever 3: External Relationships ................................................................................ 12
Lever 4: Internal Organization .................................................................................. 13
Platform Establishment and Displacement ................................................................... 13
Open Source Software .................................................................................................. 16
Commercial Interest in Community-driven Development ........................................... 16
Related works on Commercial Open Source ................................................................ 20
Chapter 2 – Strategic Considerations for Open Source Leadership ................................. 23
IBM and Eclipse ........................................................................................................... 24
The Definition of Open Source Leadership .................................................................. 26
v
Platform Leadership in Open Source Software
Google and Android ...................................................................................................... 30
Rivalry – Inter-network vs. Intra-network Competition ............................................... 34
Suppliers – Securing the Upstream Value Chain .......................................................... 38
Complementors – Identifying and Securing Critical Complements ............................. 43
Buyers – Controlling the Path to the Customer ............................................................ 45
Substitutes and New Entrants – The Threat of Shifting Platform Boundaries ............. 49
Chapter 3 – A Case Study on Hadoop ............................................................................... 57
History and Origins ....................................................................................................... 57
Hadoop and the Big Data Phenomenon ........................................................................ 59
The Relational Database ........................................................................................... 60
Hadoop to the Rescue ............................................................................................... 61
Architectural Overview ................................................................................................. 64
Distributed Storage ....................................................................................................... 65
Job Managers and Coordinators................................................................................ 65
Distributed Processing Frameworks ......................................................................... 66
Scripting Engines, Libraries and SQL on Hadoop .................................................... 68
Administration and Management .............................................................................. 70
Market Overview .......................................................................................................... 72
Strategic Factors affecting Platform Leadership within the Hadoop Ecosystem.......... 78
Rivalry - Inter-network vs. Intra-network Competition ............................................ 79
Suppliers - Securing the Upstream Value Chain ....................................................... 83
Complementors - Identifying and Securing Critical Complements .......................... 89
Buyers - Controlling the Path to the Customer ......................................................... 91
Substitutes and New Entrants - The Threat of Shifting Platform Boundaries .......... 94
vi
Table of Content
Chapter 4 - Conclusion ..................................................................................................... 99
Areas of Further Research ...................................................................................... 100
Appendix ......................................................................................................................... 101
List of Figures ................................................................................................................. 121
List of Tables ................................................................................................................... 123
References ....................................................................................................................... 125
vii
Platform Leadership in Open Source Software
This page is intentionally left blank.
.
viii
Introduction
For the first decade of its existence, the idea of publically sharing source code appeared
to be fundamentally incompatible with the idea of building software for profit. Richard
Stallman, the founder of the GNU Project and a pioneer of the “free and open source software”
(“FOSS”) movement, framed the decision of developing proprietary versus open source software
as a “stark moral choice” between individual profit and the greater good [1].
Despite subsequent attempts by a multitude of individuals (including Stallman himself) to
clarify that the term ‘free software’ refers to the ability to use or modify a product freely and not
to its price, profit-seeking software firms in the late 1980s and the early 1990s largely opted for
proprietary development models in order to maximize appropriability. Firms such as Microsoft,
Oracle and SAP provided real-world evidence for the profitability of the proprietary model by
becoming some of the most valuable companies in the world. These vendors’ extraordinary
successes can be partially attributed to their ownership of proprietary industry platforms in
operating systems, database management systems and applications respectively. Auto-catalyzed
by powerful network effects, tremendously valuable business networks formed around the
technologies provided by these vendors, and these firms leveraged their exclusive ownership of
the core intellectual property to capture a disproportionally large amount of the value generated
in these ecosystems. In fact, some of these firms leveraged their dominant platform positions so
effectively that they were investigated for antitrust violations [2].
The success of these firms have captured the attention of academics and corporations
alike, and a substantial amount of effort has been put into understanding how aspiring platform
providers can replicate their successes. As a result, concepts such as ‘enveloping’, ‘coring’ and
‘tipping’ entered the business lexicon and the strategic management of ‘platform competition’
became a core concern of vendors competing in diverse technology markets ranging from mature
areas like application middleware to the nascent battlegrounds of mobile operating systems.
In many of these markets, open source technologies compete with the offerings of
commercial vendors, with a prominent example being Linux in the operating system space. The
success of these open source platforms have been greatly varied, but as end customers and
complement creators become aware of the powerful bargaining position held by proprietary
platform vendors, they are seeking to increase the substitutability of the platform by embracing
1
Platform Leadership in Open Source Software
open source technology. This behavior is especially common in markets where the cost of multihoming (concurrently adopting more than one platform) is high. In enterprise software, some
industry observers have pointed out that nearly all dominant ‘infrastructure’ technologies that
emerged in the last ten years have been open source [3]. Consequently, commercial platform
vendors are recognizing that the open source model is a powerful and occasionally necessary
mean to substantially increase the probability that a given platform gains widespread adoption.
Table 1 enumerates some recent examples of leading open source platforms in significant
markets created by corporate entities with the intent of commercial value extraction.
Market
Platform Technology (Commercial Founder)
Mobile Operating Systems
Android (Google), Sailfish OS (Jolla), Tizen (Samsung, Intel)
Cloud Platforms
CloudStack (VMOps), OpenStack (Rackspace), Eucalyptus
(Eucalyptus), SmartOS (Joyent)
Content Management
Wordpress (Automattic), Drupal (Acquia), Alfresco (Alfresco)
Data Management
MySQL (MySQL AB), MongoDB (MongoDB), BigCouch
(Cloudant), Riak (Basho), Redis (VMWare, Pivotal), Impala
(Cloudera), Talend (Talend)
Application Middleware /
JBoss (JBoss), SpringSource (Springsource), Zend Framework
(Zend)
Framework
Table 1- Open source platforms by commercial firms
Given that platform technologies are increasingly being developed with the open source
model, a firm seeking to establish a position itself as a platform leader in a given space must find
a way to operate within the unique constraints and operating context of open source
development. Pre-existing frameworks for platform leadership management, such as Cusumano
and Gawer’s “Four Levers”, were predicated on the assumption that key decisions such as the
degree of architectural openness were within the platform provider’s locus of control. These
assumptions are invalidated in the FOSS world, and firms seeking to orchestrate the trajectory of
a given platform-based ecosystem need to find other means of exerting their influence; this
research paper is motivated by that need.
2
Approach and Structure
This thesis is divided into four chapters. In the first chapter, existing research on
platform strategy and open source business models is reviewed in order to establish the
vocabulary and concepts required for analyzing the topic. Those familiar with concepts around
network effects and existing literature on platform strategies are encouraged to skim through this
section.
The second chapter presents a short description of “open source platform leadership” is
presented for the purpose of framing the discussion to follow along with a composite framework
for understanding the strategies of open source platform vendors, inspired by Andrew Grove’s
Six Forces Model [4]. This framework is illustrated by a case study of Google’s Android
platform. This framework is also used to structure the case study on Apache Hadoop in the third
chapter. An introduction to the history of Hadoop, its relevance to the modern technology
marketplace and an overview view of its architecture are then presented along with the profiles
of key ecosystem players. The ecosystem is then analyzed using the framework established in
Chapter 2.
The selection of Hadoop was motivated by the significant technological and economic
impact of this platform technology as well as the author’s personal interest in the subject matter.
The inputs into the case study analysis include secondary research data from existing works, as
well as original data drawn through direct discussions with key influencers within the industry.
The intent of the Hadoop case study is not to project the future of the marketplace, but rather to
understand the strategies of individual vendors in order to appreciate the logic behind their
behavior.
3
Platform Leadership in Open Source Software
This page is intentionally left blank.
4
Chapter 1 – Literature Review
Network Effects
The concept of network effects originated in the telecommunication industry and was
formalized in economic models in the early 1970’s by Bell Labs researcher Roland Artle,
Christian Averous and Jeffrey Rohlfs [5][6]. These economists identified a unique type of
consumption externality that occurs in the telecommunication industry known as network effects
or network externalities. They noted that when a customer chooses to ‘consume’ a specific
networked product connecting to that network, that decision does not only bring value to that
customer but also to all the other members of that network who were external to that
consumption decision.
In a paper written approximately a decade later, Michael Katz and Carl Shapiro advanced
the concept to industries beyond telecommunication. The essence of the concept is that the value
of any given product is not always strictly a function of the product’s intrinsic quality but that
there are many markets beyond telecommunication where “the utility that a given user derives
from the good depends upon the number of other users who are in the same ‘network’ as is he or
she.” [7]. The network referenced in the aforementioned quote does not refer only to
connections between end users, but also the connections between interdependent firms offering
compatible, complementary products and services for those end consumers. For many such
networks, existing consumers participating in the network do not directly benefit when new
consumers join the network, but they benefit indirectly as more complementary firms are
attracted to the network by these additional consumers. These new complementary firms offer
additional services or capabilities that increase the value of the network, benefiting the original
consumers. The illustrative example that the researchers used was the personal computer
market; the more users adopt a given computer platform, the more likely software producers will
develop for that platform, bringing more value to the existing users. This phenomenon is also
known as increasing returns to adoption. It is worth noting that the effect of network externality
is bidirectional in that it can catalyze both adoption as well as abandonment of a platform. Users
fleeing a platform reduce its value, increasing the relative attractiveness of alternative platforms
and thereby accelerating deflection. Figure 1 provides a simple system dynamics model
illustrating this phenomenon.
5
Platform Leadership in Open Source Software
R
Network Effect (-)
Platform
Participants
Abandonment
Adoption
+
R
Network Effect
(+)
Figure 1 – A simple system dynamics model illustrating the self-reinforcing behaviors of platform adoption and abandonment
due to network externalities (original creation)
Shapiro and Katz also produced a formal economic model of “network competition”,
which provided a basis for understanding the competitive dynamics of markets where multiple
alternative networks compete for the same customer. In the aforementioned personal computer
market, the Windows and Apple ecosystems compete effectively for the same market of personal
computer users. One major insight captured in Shapiro and Katz’s model was the fact that
consumers base their adoption or purchase decision on the expected size of a given network and
not just the current size of the network [7].
The above phenomena combine to create in a demand-side economies of scale, resulting
in natural market equilibriums where the dominant winner takes most, if not all, of industry
market share[8]–[10]. This autocatalytic nature of network effects is the reason that firms
compete for the position of being the provider of industry platforms.
6
Chapter 1 – Literature Review
Product vs. Industry Platforms
The term “platform” is a heavily overloaded word in the context of product development.
The term is often used to reference common componentry that is used to build a portfolio of
related products (or “product family”). The motivation behind the creation of such platforms is
varied, but generally revolves around the idea that efficiencies can be gained by sharing the
common costs, risks and benefits of development and manufacturing across multiple products.
Examples of such platforms can be found abundantly in the automobile industry, where the vast
majority of vendors offer a large number of product variants based on a much smaller number of
base platforms. For the purpose of disambiguation, researchers refer to this concept as “product
platform”.
According to de Weck, Suh and Chang, the design of a product platform is a firm-internal
optimization problem; a firm must search through the space of platform design possibilities in
order to identify a design that maximizes the cost savings of component reuse, while
simultaneously minimizing the compromises associated with component sharing [11]. In
contrast, the search space for the design of an industry platform is far larger by definition, and
the analysis of such platforms is not bounded to a single firm. Industry platforms are the
technological infrastructure that allows independently evolving goods and services from different
firms to be connected together into an interdependent system that creates value [12]. This thesis
is directed at studying the strategies of firms attempting to develop such industry platforms and
consequently all subsequent references to “platforms” are made in this vein.
Two-Sided Markets
Platform-mediated technology ecosystems are often modeled as two-sided markets. On
one side of the platform sits customers (e.g. Personal Computer users), who are trying to
consume the combined solution that consist of the platform (e.g. Windows) and complementary
products (e.g. Application Software) offered by the suppliers residing on the other side. This
model of an ecosystem allows for more precise characterization of the different types of network
externalities occur within a platform-based ecosystem, allowing scholars to differentiate between
“same-side” and “cross-side” network effects. In general, “cross-side” effects are generally
reinforcing. The value of the platform increases for a consumer when there are additional
complementors that join the network, and vice versa. In contrast, “same-side” effects are
7
Platform Leadership in Open Source Software
typically reinforcing on the consumer side and balancing on the complementors’ side.
Additional consumers increase the viability of the platform and therefore its value to other
consumers. However, additional complement suppliers increase the level of competition within
the platform and diminishes its value for other suppliers. Figure 2 provides a simple system
dynamic model illustrating these different forces.
Potential
Customers
Adoption
++
Platform
Customers
R
Same-Side
Network Effect
R
Cross-Side
Network Effect
+
Platform
Complementors
B
Complementor
Platform Adoption
-
Potential
Complementors
Complementor
Competition
Figure 2 - A simple system dynamics model illustrating the two different types of network effect at work in a two-sided platform
(original creation)
Due to cross-side network effects, vendors on a given platform may welcome the
entrance of additional competition in the form of other complementary vendors. This can occur
if the entrance of additional vendors increases the viability and attractiveness of the platform and
these gains sufficiently offset the effects of additional competition. This is especially true in
cases where there are barriers preventing complementors from “multi-homing” on multiple
platforms and the intensity of network-level competition between platforms exceeds that of
individual complementors. As an illustrative example, software vendors invested in developing
natively on Blackberry’s OS10 mobile operating system are likely to welcome additional
vendors to develop apps for that platform, as a more vibrant app ecosystem is expected to offset
additional competition that they would face within the Blackberry App Store. Consistent with
this observation, Kevin Boudreau found that an increase in the variety of software application
producers within a mobile application ecosystem “increases innovation incentives” due to
8
Chapter 1 – Literature Review
network effects [13]. This phenomenon echoes Shapiro and Katz’s earlier research, which
showed that under network competition, a monopolist complement supplier within a given
network would counter-intuitively benefit from the entry of additional complement suppliers.
However, Boudreau noted that an increase in similar types of application producers actually
diminishes the motivation of developers as they become “crowded out” of the market.
The two-sided model also illustrates that platform providers must find ways of attracting
participants to both sides of the platform in order for the ecosystem to become viable. The study
of how platform vendors manage this has been a subject of great interest. One strategy is crossside subsidies. A two-sided platform provider may focus its monetization strategy on a single
side and opt to offer “free” or heavily subsidized goods and services on the other. Van Alstyne
and Parker showed that by lowering prices on one side of the network, platform providers can
change the shape of the demand curve on the other side, resulting in a net increase in overall firm
profits. As each “side” of the platform represents markets in their own right, this results in an
interesting phenomenon where an effective monopolist in a given market may volunteer to lower
its price below its marginal cost in order to maximize profits. For example, video game platform
vendors like Sony or Microsoft often choose to offer their software development toolkits to video
game producers for free (or close to free), despite the fact that they are effectively the only
supplier for that essential ingredient to video game production [14]. Conversely, price increases
on one-side of the network, even in a price-inelastic market, may have the counter-intuitive
effect of lowering organizational profit due to its negative impact on demand on the other side;
the cross-side implications of price changes in a two-sided network makes pricing a complicated
matter.
Topology of Platform Roles and Openness in a Platform-Mediated Network
Eisenmann, Parker and Van Alstyne identified four distinct roles that network participants
can play in participating in a platform-mediated network [15]. Beyond identifying “demandside” and “supply-side” platform users, which refer to consumers and complement providers
respectively, the trio further differentiated between “platform providers” and “platform sponsors”
(Figure 3). The platform provider acts as the “primary point of contact” for users on both sides
of the platform while the platform sponsor is responsible for determining which parties may
participate in the network. For example, banks such as Citi or Barclays act as platform providers
9
Platform Leadership in Open Source Software
for credit payment networks, whereas Visa itself acts as the platform sponsor. The trio asserted
that platforms differ in their degree of openness to these different roles. Based on the
categorization provided by this group, the “sponsor” role of Linux are occupied by the open
source community and therefore highly open. Table 2 enumerates a few select computing
platforms and the openness of their various platform roles as identified by Eisenmann, Parker
and Van Alstyne.
Linux
Windows
Macintosh
iPhone
Demand-side Platform User Open
Open
Open
Open
Supply-side Platform User
Open
Open
Open
Closed
Platform Provider
Open
Open
Closed
Closed
Platform Sponsor
Open
Closed
Closed
Closed
Table 2 - Comparison of openness by role in platform-mediated networks. Reproduced. [15]
Figure 3 –Roles and Relationships in a Platform-Mediated Network according to Parker, Eisenmann and Van Alstyne
(Reproduction of Figure 2) [15]. An open source platform market can be viewed as a market where the role of the platform
sponsor is played by an open source community.
10
Chapter 1 – Literature Review
Platform Leadership and the “Four Levers” Framework
During its explosive growth phase in the nineties, Microsoft and Intel jointly led the
platform powering the personal computer (PC) market. Annabelle Gawer hypothesized in her
doctoral thesis that Intel’s continued success in a highly fragmented and vertically disintegrated
market stemmed from a highly evolved practice of fostering and managing the creation of
complementary products in the personal computer ecosystem. This sophisticated platform
management practice enabled Intel to establish itself as one of the primary beneficiaries of the
growth in the PC ecosystem. She observed that in a rapidly changing technological landscape
like that of the personal computer market, platform providers cannot simply seek to leverage
cross-side network effects by maximizing the supply of complementary products, but rather to
ensure that complementors are “innovating in ways that are favorable” to their platform. For this
reason, Gawer defined platform leadership as “a firm’s ability to influence the development of a
large number of complementary products by almost all other firms in their industry”[16]. This
definition of platform leadership is used throughout this paper. It is worth noting that this
definition is not restricted to a specific platform “role”; a platform leader can play any of the four
roles identified by Eisenmann, Van Alystne and Parker.
In 2002, Gawer further validated and elaborated this work with her thesis supervisor
Michael Cusumano, by categorizing Intel’s activities into four aspects of platform leadership
management. The pair called this framework “the Four Levers of Platform Leadership” [10]. An
overview of the four levers identified is presented in the sections below.
Lever 1: The Scope of the Firm
Platform firms must continuously decide which portions of the overall system to deliver
itself and which to leave for complementary vendors in the ecosystem. This is a continuous
process as the platform vendor must introspect its own capabilities and the dynamics of the
marketplace (including the behavior of platform competitors) and adjust its approach. As an
example, Microsoft has always been willing to directly compete with its software application
partners due to its immense software development capability, but had left hardware largely to its
partners. However, the company made a drastic change to this approach in 2013 by acquiring
Nokia’s mobile phone business for $7.2 Billion USD. Understanding the motivations and
11
Platform Leadership in Open Source Software
decision-making process of this strategic change is beyond the scope of this paper, but it
illustrates the dynamic nature of firm-scope management.
Lever 2: Product Technology
A platform leader’s decisions regarding the design of its technology’s architecture,
interfaces and intellectual property management significantly affect the nature of innovation that
participants of its ecosystem are able to contribute. A modular architecture enables contributions
from complementors and is generally preferred over “integrated” architectures with low
substitutability of components from the perspective of innovation enablement. However,
platform firms must determine the openness of their platform interfaces, balancing the
competitive advantages offered by exclusive proprietary access to ‘core’ platform functionality
with the need to encourage complementors. A similar balancing act occurs with regards to the
management of intellectual property. Generally speaking, the more open a platform leader is
with its intellectual property, the more vibrant is its ecosystem. As with the previous lever, the
management of product technology is also a continuous process, though it is worth noting that it
is typically more difficult for a firm to restrict an open policy than the converse.
Lever 3: External Relationships
Beyond making internal decisions regarding the scope and nature of its technologies,
platform leaders must also orchestrate the actions of complementary vendors in a manner that is
favorable to the platform. Gawer and Cusumano found that Intel was especially mature in this
aspect of platform leadership, acquiring organizational capabilities and making substantial
investments to build consensus and control of platform decision making, as well as encouraging
the right balance of collaboration and competition between complementors. As an example,
Intel shared with Gawer that it employs a unique strategy when attempting to drive the definition
of new interfaces for the PC platform. It attempts to create “momentum” behind new interface
standards by initiating the design process with a small interest group of the most influential
players within the ecosystem before involving the larger ecosystem. This approach helps to
relieve the “design by committee” challenges of completely open and democratic processes
while maintaining the benefits of having external contribution and validation.
12
Chapter 1 – Literature Review
Lever 4: Internal Organization
The final lever Cusumano and Gawer identified pertains to the internal organizational
structure and processes that a platform leader puts in place to manage the inherent tension of
managing collaboration and competition. At Intel, this began with the identification and
differentiation of the competing objectives of the company. At Intel, “Job 1” refers to the core
organizational objective to sell more microprocessors, while “Job 2” refers to the desire to
compete directly in complementary businesses, while “Job 3” refers to the task of growing new
lines of business that may not be directly related to the core microprocessor business. By
acknowledging the inherent conflicts that these objectives create and by providing a vocabulary
for discussing them, Intel enables its management team to manage this tension proactively and
consciously. Beyond this, Intel also created organizational groups that were dedicated to these
different objectives. This not only served to focus the internal groups but also creates a level of
organizational separation that alleviates the conflicts of interests that external partners perceive.
Platform Establishment and Displacement
While Gawer and Cusumano focused their studies on the activities required to manage
and sustain the position of a platform leader, others focused their attention on the process of
establishing and displacing industry platforms. The early works of Rohlfs had established that
even potentially viable networks will naturally be attracted to a stable equilibrium of zero
participants unless a critical mass of participants is reached [5]. Evans and Schmalensee
extended the model to two-sided platform businesses and illustrated the challenge of reaching the
critical threshold on both sides of the platform. Depending on whether it is easier for potential
participants on a given side to join the platform or for existing participants to drop off, firms
aiming to launch two-sided platforms may find themselves in a position to subsidize
participation on both sides of their network in order to reach critical mass. The pair also found
that design decisions that reduce the resistance to participation on both sides of the network not
only lowers the critical mass required, but also increases the equilibrium adoption of the platform
once established.
The topic of platform displacement closely relates to the research on architectural
innovation and its impact on incumbent leaders completed by Henderson and Clark [17]. Prior to
their research, the prevailing understanding was that while all innovations create opportunity for
13
Platform Leadership in Open Source Software
new entrants to enter the market and displace the incumbent, only ‘radical’ innovations that
eradicated the technological competencies of incumbents create conditions favoring the new
entrant. In studying the then-nascent semi-conductor industry, Henderson and Clark observed
that evaluating the disruptive nature of an innovation based on the degree of technological
change was inadequate, as there were numerous occasions where seemingly incremental
technological changes resulted in the displacement of industry leaders. Instead, Henderson and
Clark found that architectural innovations – innovations that impacted the manner in which
components of a product system connected together – tended to be significantly more disruptive
to incumbent firms than technological innovations at the component level, regardless of the
degree of ‘radicalness’. The pair hypothesized that architectural innovations tend to be more
disruptive to the established firm as architectural knowledge tends to be captured in a firm’s
“structure and information-processing processes” which is difficult for a successful firm to
recognize and address. Similarly, platform displacement is also more likely when an innovation
allows for a reconfiguration of how the participants interact with one another.
An alternative path to platform displacement identified by Eisenmann, Parker and Van
Alstyne is the strategy of platform envelopment. Envelopment is interesting as it “provides a
mechanism for platform leadership change that does not require breakthrough innovation or
Schumpeterian creative destruction” [18]. The authors found that while network effects make it
difficult for a new platform entrant to displace an established platform, the incumbent may be
displaced through envelopment when the capabilities of its platform become bundled as a part of
an enveloping platform serving an adjacent market. In other words, the new entrant or attacker
can expand the scope of the competition and leverage economies of scope and scale on the
demand or supply side in order to create leverage for displacing an incumbent vendor.
Eisenmann, Parker and Van Alstyne cites the example of Microsoft’s successful “attack” on
RealNetwork’s dominant media streaming platform in the late 1990s by bundling streaming
services into its Windows NT server offering. The authors established a taxonomy for
envelopment attacks consisting of three major categories: conglomeration, intermodal, and
foreclosure (Table 3). In reaction to such attacks, platform leaders can seek to match the new
bundle, pursue legal protection, or exit the market if it cannot match the new entrant on the new
basis of competition.
14
Chapter 1 – Literature Review
Attack Type
Description and Example
Conglomeration
The attacker joins functionally unrelated platforms together to create a new
bundle in order to leverage the economies of scope and scale to its
competitive advantage.
e.g. Cable and telephone television firms reciprocally bundling TV, phone,
internet services to attack each other.
Intermodal
The attacker bundles two weak substitutes that deliver the same
functionality, but using different modalities and offering a single composite
platforms that negates the user’s need to choose between different modes.
e.g. Netflix bundling in DVD-by-mail and streaming delivery as a single
product.
Foreclosure
The attacker bundles two complementary capabilities together to create a
synergistic offering that is superior to the individual products.
e.g. LinkedIn’s bundling of social network and job matching into a single
platform.
Table 3 - Taxonomy of Envelopment Attacks
15
Platform Leadership in Open Source Software
Open Source Software
Most scholars trace the origins of open source to the Free Software Movement started by
Richard Stallman when he incepted the GNU Project [1]. Responding to a perceived increase in
the limitations imposed by proprietary software vendors, the movement sought to restore the four
“essential freedoms” of computer users through the creation of “free software”. According to
Stallman, these freedoms are:
0. The freedom to run the program as you wish, for any purpose.
1. The freedom to study how the program works, and change it so it does your
computing as you wish. Access to the source code is a precondition for this.
2. The freedom to redistribute copies so you can help your neighbor.
3. The freedom to distribute copies of your modified versions to others. By doing this
you can give the whole community a chance to benefit from your changes. Access to
the source code is a precondition for this.
In the beginning of the movement’s history, the development of such software was
primarily of interest to academics and sophisticated individuals whose freedoms were infringed.
Despite Stallman’s continuous reminder to think of free software as “‘free speech,’ not ‘free
beer.’”, few commercial enterprises engaged in the creation of “free” software. It was widely
perceived that the act of sharing intellectual property was counterproductive to the objective of
profit extraction. In his widely cited paper “Profiting from Innovation”, David Teece argues that
innovators are best positioned to capture the value of their inventions if their intellectual property
is legally protected or “the nature of the product is such that trade secrets effectively deny
imitators access to the relevant knowledge”[19]; this condition is known as a tight
appropriability regime. The act of sharing source code increases access to potentially
differentiating knowledge, and therefore reduces appropriability. Given that the majority of
profit-seeking software firms view themselves as innovators trying to capitalize on their
creations, such firms tended to view the “free software" world with skepticism.
Commercial Interest in Community-driven Development
Commercial interest in community-driven development began to shift in the mid-nineties
with the emergence of Linux. University of Helsinki student Linus Torvald began developing
16
Chapter 1 – Literature Review
Linux in 1991 as a personal project to maximize the capabilities of his specific hardware at the
time. The popularity of the project exploded shortly after Torvald shared his work and adopted
the GNU General Public License in 1992 (the original license that shipped with Linux explicitly
forbade commercialization). Today, Linux does not only power the personal computers of
dedicated hobbyist like Torvald, but is also competing successfully against proprietary
commercial systems in devices ranging from mobile handhelds to the most powerful
supercomputers in the world (Figure 4).
Worldwide Server
Operating
Environments ,
2012
Mobile Operating
Environment
Shipments,
2012
OS of the World's
Top 500
Supercomputers,
2014
Figure 4 - Linux (in Orange) market share in various computing segments [20], [21]
Commercial interest in Linux and open source was initially motivated by the surprising quality
and effectiveness of the community-driven development model. As Di Bona, Octman and Stone
observed, the Linux volunteer community “produced a piece of software that would otherwise
require the might and resources of someone like Microsoft to create”[22]. Aided by the rapid
advance in internet collaboration technologies, the open source model managed to reconfigure
the requirements of production so substantially that it eliminated what had previously been
regarded as a natural monopoly.
In his popular 1999 essay “the Cathedral and the Bazaar”, Eric Raymond observed that
the development style employed in the development of Linux was unique even within the open
source community and hypothesized that this development style was a major reason for the
17
Platform Leadership in Open Source Software
project’s success. Chief amongst Raymond’s finding was that Linus Torvald focused his
attention on keeping his contributors engaged and worked to ensure constant activity within the
community (i.e. operating a bazaar) rather than enforcing consistency and architectural elegance
(i.e. building a cathedral). Moreover, Raymond asserted that the project’s frequent release cycle
and large, active user base assure the project’s quality more effectively than traditional software
engineering practices.
Raymond attributed much of Linux’s success to the rapid progress enabled by this
“Bazaar” model and attempted to validate the effectiveness of this approach by consciously
developing his own project in the same manner. Raymond’s findings influenced Netscape to
release the code for its then-popular web browser Communicator under an open source license
with the hope of leveraging the development capabilities of the open source community as a
competitive advantage over Microsoft. Unfortunately, Netscape’s effort failed to garner
sufficient attention from the community and it was eventually acquired by AOL before
eventually being disbanded. Raymond attribute Netscape’s failure to engage the community to
their lackluster efforts in removing the barriers to entry for contributors. As an example,
contributors of the product needed a license for a third-party UI library (Motif) just to work on
the product during its first year, which created a significant barrier for participation. While
Netscape failed to leverage the open source community as an engine for commercial success, the
challenges of their initial open source project led to the creation of the Mozilla project, which did
manage to attract significant community attention and resulted in the popular browser Mozilla
Firefox.
Despite the critical role of the community development model in progressing a number of
foundational technologies in the modern technology landscape, commercial interest in free and
open source development remained modest until the emergence of the Open Source Initiative
(OSI) and the highly publicized growth of Redhat, Inc in the late 90’s. Up until the formation of
the OSI, the commercial software development world did not delineate between the moral and
philosophical ideals of “Free Software” and the more pragmatic motivations behind communitydriven development. Founded by Eric Raymond and Bruce Perens in late February 1998, the
Open Source Initiative was created shortly after Netscape’s release of its proprietary code earlier
that month. The pair wanted to use the highly publicized event to advocate for “the superiority
18
Chapter 1 – Literature Review
of an open development model” [23]. The term “open source” was coined at that time in order to
avoid the “the philosophically- and politically-focused label of ‘free software’” [23]. The OSI
created a pragmatic definition of “open source” software free of the constraints and judgmental
ideology of the Free Software Foundation, instead focusing on practically capturing the
requirements of licenses that should be considered “open” (Table 4). In particular, the definition
of open source software explicitly enables the possibility of deriving “paid” software from open
source software, which is forbidden for “Free Software”.
Criteria
Description
Free Redistribution
The license must allow for free or paid redistribution of the
software without royalties or other fees.
Source Code
Source code for the software must be reasonably available and
the license must allow for its redistribution.
Derived Works
“The license must allow modifications and derived works, and
must allow them to be distributed under the same terms as the
license of the original software.”
Integrity of the Author’s
“The license must explicitly permit distribution of software built
from modified source code.”
Source Code
No Discrimination
Against Persons or
“The license must not discriminate against any person or group
of persons.”
Groups
No Discrimination
Against Fields of
Endeavor
“The license must not restrict anyone from making use of the
program in a specific field of endeavor. For example, it may not
restrict the program from being used in a business, or from being
used for genetic research.”
Distribution of License
“The rights attached to the program must apply to all to whom
the program is redistributed without the need for execution of an
additional license by those parties.”
License Must Not Be
“The rights attached to the program must not depend on the
program's being part of a particular software distribution.”
Specific to a Product
19
Platform Leadership in Open Source Software
License Must Not
Restrict Other Software
License Must Be
Technology-Neutral
“The license must not place restrictions on other software that is
distributed along with the licensed software.”
“No provision of the license may be predicated on any
individual technology or style of interface.”
Table 4 – A description of the ten criteria of open source software as defined by the Open Source Initiative. Modified and
Adapted from http://opensource.org/osd
This “rebranding” effort and the pragmatic, business-case driven approach of the OSI can
partially be credited for the increased interest in commercial open source development over the
past decade, though the success of Linux (and Redhat in particular) likely also played a pivotal
role. The study of open source business models within the research community has
correspondingly increased.
Related works on Commercial Open Source
In his 2005 paper, Sandeep Krishnamurthy attempted to categorize the business models
of open source firms into four distinct categories: (1) software distributors, (2) software
producers following the GPL model (firms leveraging open source components to create
derivative products that was also open source), (3) software producers not following the GPL
model (firms leveraging open source components to create derivative products that was
proprietary), and () third-party service providers [24]. Krishnamurthy further summarized that
the primary appeal of open source products to corporations stem from the perception of superior
performance, lowered adoption risk and lower total cost of ownership. Finally, he identified
community support, presence of proprietary or open source competition, relative competiveness
and marketing as the key factors affecting the profitability of open source firms.
In a Research Policy article on the topic of “melding proprietary and open source
platform strategies”, Adam West analyzed three major software platform vendors’ explorations
with community-driven development in order to understanding the strategies and motivations for
participation in open source [25]. West observed that the modern computer industry evolved
from a vertically integrated market dominated by vendors who held end-to-end control of the
“stack” to a market comprised of horizontally dominant platform firms, exemplified by
20
Chapter 1 – Literature Review
Microsoft and Intel. Focusing on the operating system as the platform of study, he observed that
industry interest in open source was motivated by a desire of the leading contenders in the
industry to challenge Microsoft’s dominant Windows platform. West chronicled the differing
efforts of IBM, Apple and Sun Microsystems as they engaged the open source movement and
identified that their participation in open source were motivated by different intentions. Table 5
summarizes West’s discussions on the motivations of these different vendors.
Vendor
Open Source
Projects
Intentions
Apple
FreeBSD
OpenDarwin
Apple’s primary intentions for participating in open source
was to leverage and reuse some of the market leading
components that were being built in the open source
community (in particular the Free BSD project).
IBM
Apache
As a part of shifting its business focus towards applications
and application platforms, IBM tried to reduce the control
that Microsoft held as the platform leader by pushing the
computing industry towards open standards while
positioning itself as the leading integrator of technology.
Its involvement in projects like Apache and Eclipse were
means of accelerating the development of its proprietary
products.
Eclipse
Linux
Sun
Microsystems
Java
Open Office
Linux
Sun’s motivation for engaging in open source was primarily
to leverage the “horsepower” of the open source
community to accelerate the development of alternate
platforms to challenge Microsoft’s leadership in the market
of application frameworks (i.e. .NET vs. Java) and office
productivity (i.e. Office vs. OpenOffice).
Table 5 - Apple, IBM and Sun Microsystem's involvement in open source and their motivations for participating. Summarized
from the contents of [25].
West summarizes that there are two broad approaches to blending proprietary and open
source strategies. Firstly, a firm can choose to concede the more “commoditized” layers of the
platform to the open source community in order to focus their investments in differentiating
layers. Apple’s decision to adopt a variant of the open source FreeBSD kernel for its operating
21
Platform Leadership in Open Source Software
and continued development of a unique user interface shell exemplifies this approach. The
second approach is to disclose technologies and intellectual property to promote indirect network
effects. Both IBM and Sun’s experimentations with open source reflect this latter approach.
West’s observations regarding IBM’s motivations behind its open source strategy are
supported by Cepek, Frank et al. in their 2005 article in the firm’s own IBM System Journal
entitled “A history of IBM’s open source involvement and strategy” [26]. The paper, written by
IBM’s own employees who were involved in the various efforts, highlighted IBM’s recognition
that the open source movement was a business reality within its industry and recalls its efforts in
harnessing the movement for its own strategic intentions. The authors summarizes IBM’s
strategic intentions for involvement in open source as: (1) encouraging the use of “open source
implementations of open standard”, (2) fostering greater variety and choice and (3) enhancing
IBM’s mindshare.
Nicolas Economides and Evangelos Katasmakas created economic models for modeling
the competition between proprietary and open source platforms in their 2006 article in
Management Science [27]. The pair observed that a proprietary platform vendor in a multi-sided
market could find it profitable to set a price below its marginal cost on one side in order to
maximize profits on all sides. The models found that markets supported by proprietary platforms
are more profitable than those supported by open source platforms, though the variety of
available complements is generally higher in open source platforms. However, this finding was
made with the assumption that open source platform profits (the profits derived from the selling
of the platform itself) is always zero.
22
Chapter 2 – Strategic Considerations for Open Source Leadership
What does it mean to be an “open source platform leader”? A naïve definition would be a
simple literal interpretation of the two clauses from the phrase’s bisection – a “platform leader”
that leverages “open source” technologies. However, the use of open source technology is so
prevalent today that there are scarcely any commercial software firms that do not leverage open
source in some form, and consequently this definition is too broad to be useful. Even Microsoft,
the canonical proprietary platform vendor, utilized open source libraries in the delivery of its
Windows NT operating system [28]. Similarly, Cisco Systems delivers a variety of hardware
devices that run embedded versions of Linux and is generally considered a “platform leader” by
scholars like Cusumano and Gawer. Clearly, the moniker of “open source platform leader”
appears to be ill-fitting for these firms that appear to epitomize proprietary platform leadership.
A more restrictive definition of the term “open source platform leader” would be “a
leading provider of open source platforms”. In other words, candidacy for “open source platform
leadership” is restricted to platform providers who specifically create open source products. This
definition roughly aligns with Krishnamurthy’s second category of commercial software
vendors, which he called “software producers following the GPL model”, broadened to include
other types of open source licenses, but with the additional constraint of the software being a
platform product. At first glance, this appears to be a more appropriate definition, as it would
remove the above counterexamples such as Microsoft or Cisco from candidacy without
disqualifying obvious candidates like Red Hat. However, this definition is actually too narrow as
it excludes many firms that opt to utilize open source strategically in order to drive the adoption
of their platform products, even proprietary ones. One excellent example of such a firm is IBM
and its inception of the Eclipse open source project in the software application platform space.
23
Platform Leadership in Open Source Software
IBM and Eclipse
In 2001, IBM open sourced its Eclipse technology and founded the Eclipse consortium in
order to drive the adoption of the Java-based application platform and integrated development
environment (IDE) as an alternative to Microsoft’s .NET and Visual Studio stack [29]. After
utilizing the resources it had obtained from its acquisition of Object Technologies International
(OTI) to develop an internal product platform to improve the development efficiency and
consistency of its own application development, IBM decided to leverage OTI’s technology to
drive the adoption of its WebSphere suite of application platform technologies. Recognizing it
was a latecomer to the application platform market competing against a powerful incumbent with
an established ecosystem in Microsoft, IBM theorized that “in order to build momentum around
[Eclipse] and to get more vendors to build their products on top of it, [it] had to make it open
source”.
IBM’s hypothesis appeared to be correct, as Eclipse became one of the most popular
open source projects in the world and established itself as the dominant player within the market
of Java Integrated Development Environments. Only four years after IBM made Eclipse open
source, a survey conducted by the publishers of SD Times (a popular software development trade
magazine at the time) found that approximately two-thirds of all its readers utilized Eclipse in
their workplace (Figure 5), followed by IBM’s proprietary WSAD at 21%. Eclipse’s substantive
user base as well as vibrant ecosystem of complementary vendors provided a substantial
competitive advantage to WSAD.
70%
Eclipse
60%
IBM WebSphere Studio App.
Developer*
Borland Jbuilder
50%
40%
Sun NetBeans
30%
Oracle Jdeveloper
20%
JetBrains IntelliJ IDEA*
10%
BEA WebLogic Workshop
0%
Jan-02
Jan-03
Jan-04
Jan-05
Figure 5 - Results from the "Java Use and Awareness Study" from BZ Research, 2005
24
Microsoft Visual J++
Chapter 2 – Strategic Considerations for Open Source Leadership
The specific license that IBM put in place when establishing Eclipse was a derivative of
IBM’s Common Public License. The license ensured that IBM was able to commercialize
Eclipse technology without having to release the derivative product as open source. As a
consequence, IBM was able to leverage Eclipse’s success to bootstrap the ecosystem of its own
proprietary WebSphere Studio Application Developer (WSAD) platform product, which
provided additional functionality while being fully compatible with complements produced by
the Eclipse ecosystem.
Although IBM spun off the responsibility of managing Eclipse to an independent nonprofit organization, the Eclipse Foundation, it continues to be a major force driving the continued
evolution of the Eclipse platform. One way this is made evident is in the fact that IBM continues
to be the single largest contributor to the project (Figure 6). Given IBM’s ability to influence the
activities of others to enhance its own offering, it clearly exhibits the qualities of a “platform
leader” in this context. Moreover, IBM’s strategic and intentional use of the open source model
as a means of bootstrapping a proxy ecosystem for its own proprietary offering suggests that it
should be considered an open source platform leader.
Eclipse Project Committers by Company
IBM
32%
IBM
Oracle
itemis AG
Figure 6 - Eclipse Project Committer by Company (excluding committers without and with unknown corporate affiliations).
Taken from http://dash.eclipse.org on August 4th, 2014
25
Platform Leadership in Open Source Software
The Definition of Open Source Leadership
IBM’s utilization of the open source community to accelerate the adoption of its platform
technology against an incumbent platform rival is not a particularly unique or novel tactic. As
mentioned during the introduction to this paper, Google’s Android, Samsung’s Tizen and
Nokia’s Maemo are all open source mobile operating system efforts that were looking to displace
Apple’s dominant iOS platform. It is unlikely that these commercial firms opted to open source
their platform technology out of altruistic charity. Rather, they adopted the open source model as
a strategy intended to accelerate the development of a critical mass of users and complementors
in markets marked by network-level competition. However, IBM’s ability to use open source to
improve the competitiveness of its proprietary software product is illustrative in demonstrating
that open source platform leadership is not limited to a specific type of software license or even
business model. It is entirely possible to for a proprietary software vendor or even a service
provider to be an “open source platform leader”.
As mentioned in the literature review, a “platform leader” as defined by Cusumano and
Gawer is a firm that is able to influence the activities of other industry participants in order to
create complementary products and solutions that enhance its offering. This definition has broad
applicability as the leader can play any number of roles within the ecosystem so long as it
“drive[s] industry wide innovation for an evolving system of separately developed pieces of
technology” [30]. Building upon this definition, “open source platform leadership” is therefore
best described as “a firm’s ability to influence the development of a large number of
complementary products” through engagement in open source. The unique characteristic of
open source platform leaders is that they participate in open source development with the
specific purpose and intention of gaining a platform advantage.
While the usage of an open source model may help a platform vendor accelerate the
prevalent adoption of its platform, the model also comes with its own unique set of challenges.
These challenges are systematically analyzed in the sections to follow in order to provide a
holistic framework for understanding the strategic considerations for open source leadership.
26
One could argue that it is more important for open source platform leaders to be forwardlooking in the management of their platforms when compared to their proprietary counterparts.
The additional involvement of community contributors and the irreversible nature of “open
sourcing” intellectual properties mean that decisions that are relatively easy for proprietary firms
to make require more planning and lead time for an open source platform leader to affect. For
example, despite IBM’s significant investments and participation in Eclipse, changes to the core
platform of Eclipse are governed by the independent Eclipse Foundation (which IBM helped
establish) consisting of members of the Eclipse ecosystem, many of whom compete with IBM in
different markets. This governance structure significantly increases the friction and latency of
manipulating Lever 2 (product technology) and consequently IBM must be more proactive and
forward-looking if it wishes to deploy that lever effectively.
In order for a platform leader to proactively manage its platform strategy, it must have a
holistic understanding of the forces that shape the dynamics of competition in its given market
and establish a strategy to manage these forces. The formulation of such a strategy requires a
hypothesis on how these forces will shift as the platform evolves over time. A traditional
analysis framework used for understanding the forces that affect a given market is Porter’s Five
Forces model of industry analysis (Figure 7) [31].
Figure 7 - A reproduction of Porter's Five Forces Model. The horizontal forces represent the critical factors arising from the
value chain of the market, while the vertical forces and the center circle represent competitive forces.
27
Platform Leadership in Open Source Software
While Porter’s model provides a useful outline for analysis, it was created with the
intention of analyzing the overall attractiveness of any given industry and does not specifically
consider the unique dynamics of platform-driven markets. In particular, Porter did not consider
the critical role that platform complementors play in affecting the competitive balance within a
platform market. To address this, industry practitioners such as Intel’s Andrew Grove have
augmented Porter’s framework with an additional factor capturing the influence of partners
(Figure 8).
Figure 8 - Six Forces Diagram, taken from Only the Paranoid Survive [4]
Grove’s variant serves as useful scaffolding for the discussions of different factors that
open source platform leaders need to understand and manage. In the sections to follow, different
candidate factors are identified and reasoned, structured by the categorization provided by
Grove’s Six Forces model. For the purpose of this discussion, the considerations brought on by
the emergences of new entrants and substitutes are evaluated together. Table 6 presents a
summary of these different considerations.
An overview of Google’s Android project is presented preceding this discussion to serve
as a lighthouse reference for describing these different factors. The relevance of these identified
factors to the actual behavior of aspiring open source platform leaders is further validated in the
case study on Hadoop in the chapter to follow.
28
Chapter 2 – Strategic Considerations for Open Source Leadership
Considerations
Description
Rivalry
The relative intensity of inter-network and intra-network competition
shapes and governs the behavior of the open source platform vendor.
Vendors must continually adjust their behavior as this will change over
time.
Suppliers
Qualified engineering talents is the primary constraining resource that an
open source software vendor requires. A platform contender must
understand the specific organization structure of the open source
community in order to access the right talents.
Complementors Complementors in the open source world can come in the form of
commercial allies as well as community contributors. Vendors must form
a hypothesis on what are the key complements in order to secure superior
or exclusive access.
Buyers
By establishing a clear understanding of the purchasing process of the
platform, platform contenders can establish superior or exclusive access to
key intermediaries to secure a competitive advantage in intra-platform
competition. They can also develop a platform advantage by injecting
themselves in the purchasing process of complements in order to exert
greater influence in the operations of the ecosystem.
New Entrants
The fact that open source platform vendors do not possess exclusive
and Substitutes
authority to define how the technology is packaged and reused means that
alternative modes of platform consumption can emerge from unexpected
sources. Platform boundaries can shift without the vendor’s involvement.
As a consequence, emergent threats in the form of new direct competition
or substitute are more arguably more prevalent in open source businesses.
Table 6 - Summary of Strategic Considerations for Open Source Platform Vendors
29
Platform Leadership in Open Source Software
Google and Android
Android is a Linux-based open source mobile operating systems created by Google using
the assets it acquired when the search giant purchased Android Inc. in 2005. Although the ‘core’
aspects of Android is open sourced to the community under the Android Open Source Project
(AOSP), Google has been and continues to be the primary engineering force behind the
continued evolution of the Android software platform. The firm deploys its considerable
engineering resources to work on the ‘next version’ in private before releasing the source to the
community. The software typically makes its way to the hands of customers through devices
created by hardware device partners such as HTC, LG and Samsung. Google collaborates with
these partners through an industry alliance known as the Open Handset Alliance (OHA).
Participation in the OHA provides partners with unique access to Googles resources and are
generally perceived as a requirement for gaining the license to deliver Google Mobile Services
(GMS). Google Mobile Services are complementary (but proprietary) services and components
that greatly enhance the value of the system, including applications like Gmail, Google Now,
Google Calendar and the Google Play Store.
Although many of hardware device partners opt to heavily alter or ‘enhance’ the versions
of Android that ship with their devices to create a unique experience that differentiates their
offerings to end users, the vast majority of these changes are cosmetic in nature and do not
fundamentally change the definition of the platform. Moreover, hardware partners who are
members of the OHA are contractually prevented from creating “forked” or “derived” versions
of platform, and instead collaborate with Google and others on the continued evolution of
Android [32]. In other words, while Google cedes some control of Android’s interface from the
perspective of end users to its hardware partners, this arrangement allows Google to remain the
definitive authority over the platform’s evolution from the perspective of software
complementors (Figure 9). At a conceptual level, this structure is not vastly different with how
Microsoft operated its Windows franchise over the years. However, the fact that Android is open
source means that Google’s role in defining and providing the platform is displaceable. A firm
with sufficient engineering resources and ability to deliver complementary services can
theoretically displace Google entirely and propose a different design trajectory for the platform.
30
Chapter 2 – Strategic Considerations for Open Source Leadership
Figure 9 - The Android Platform and the roles of Google, hardware partners and other complementors
This theoretical scenario unfolded when online retailer Amazon released the Kindle Fire
in 2011. Amazon elected not to collaborate with Google as a participant in the Open Handset
Alliance, but rather to create its own variant of the system based on what was available through
the Android Open Source Project. The Kindle Fire ran a derived or “forked” version of Google’s
Android that was later rebranded “Fire OS” with subsequent releases. The Fire OS was largely
compatible with applications built for the version of Android from which it was derived, but
Amazon replaced all of Google’s cloud and content services with alternatives from Amazon and
its partners. Amazon even provided an alternative “App Market” to connect users to
applications, offering its own Digital Rights Management (DRM) and payment infrastructure for
software vendors in the Android ecosystem. By choosing not to participate in the Open Handset
Alliance and “forking” their own version of the platform, Amazon put themselves in a position
where they can theoretically choose to evolve Fire OS independent of Google’s influences.
Since the release of the Kindle Fire, a number of companies have followed Amazon’s
path of creating Android-derived platforms without participating in the OHA. While the
majority of these firms do not have the engineering resources that would allow them to
realistically challenge Google’s dominion over the architectural trajectory of the Android
platform, a number have sizable presence in specific geographic markets such as China and are
more than capable of displacing Google as the de facto provider of complementary services
31
Platform Leadership in Open Source Software
within their markets. It is also interesting to note that even Microsoft has gotten into the game
and forking Android as a solution for developing markets by way of its Nokia acquisition [33].
While the actions of these vendors may have actually contributed to the Android platform’s
dominant market share in the mobile platform space, they have clearly been detrimental to
Google’s ability to benefit from that dominance.
Google’s management of Android is a useful reference for discussing the different factors
affecting open source platform strategy. Although the structure of the ecosystem bound together
by mobile platforms closely resembles that of the personal computing industry with which many
are already familiar, the outcome has been quite different. In particular, the battle for the mobile
industry is interesting in that a previously dominant incumbent platform leader (Apple) has been
successfully challenged by a new entrant (Google) that has opted to release its platform as open
source. Google’s changing behavior as this unfolds is a useful illustration of the dynamic nature
of open source platform strategy.
32
Chapter 2 – Strategic Considerations for Open Source Leadership
Company
Platform
Description
AliCloud
Yun OS
According to Alibaba (AliCloud’s parent
company), the Yun OS is a Linux-based
(China)
operating system that utilizes components
and tools from the Android Open Source
Project to deliver Android app compatibility.
Amazon
Fire OS
Fire OS features an optimized UI for
consuming Amazon’s content and services.
Application Programing Interfaces (API)
have also been extended to promote the
unique capabilities of Amazon’s hardware.
Baidu
Baidu Yi
Yi OS displaces Google’s GMS services with
Baidu’s implementations.
(China)
Microsoft
Nokia X
Nokia X re-skins Android with a look and
feel approximating Microsoft’s Windows
platform and replaces Googles services with
Microsoft’s own. It was originally conceived
as a low-cost solution for developing
markets.
Table 7 – AOSP-derived products by Google competitors [34]–[36]
33
Platform Leadership in Open Source Software
Rivalry – Inter-network vs. Intra-network Competition
Figure 10 - The competitive threat to a proprietary software platform vendor comes in the form of alternative platforms; open
source platform vendors must additionally contend with alternate providers, including the community, for their specific platform.
Platform leaders typically possess unique knowledge of the technologies that serve as the
technical foundation for its ecosystem; the extent it shares this knowledge is one of the decisions
that the firm can take (“Lever 3 – Relationship with Complementors”). For proprietary software
platform vendor, this unique knowledge is often encapsulated in the proprietary intellectual
property that source code represents. Leveraging this unique asset, the firm is able to act as an
effective monopoly within the sub-markets that its network participants represent, as no
competitors are capable of displacing their dual roles as the platform provider and sponsors.
Competition comes exclusively in the form of alternative platforms or “inter-network”
competition. For example, as the exclusive provider of the iOS, Apple Inc. does not need to
worry about another firm supplanting its role as the dominant distributor of iOS applications to
customers. It is also secure in its position as the dominant provider of development tools to iOS
application developers (complementors in this ecosystem). Apple’s competitive concerns stem
purely from alternative ecosystems and the possibility of customers or application vendors
abandoning the iOS platform for alternatives such as Microsoft’s Windows or Google’s Android
platforms. In other words, the dominant strategic concern for proprietary platform vendors is the
establishment and sustenance of the platform itself as an industry standard.
Compared to their proprietary counterparts, open source platform vendors face an
additional dimension of complexity affecting its competitive strategy. Beyond the challenges of
34
Chapter 2 – Strategic Considerations for Open Source Leadership
establishing its platform as the dominant industry standard, open source vendors must
additionally work to establish itself as the primary provider of that standard. This challenge is
clearly evident in Google’s Android ecosystem. Like Apple, Google strives to establish Android
as the dominant platform against alternatives like iOS and Windows Phone in the mobile
computing space. However, due to its decision to open source the development of Android,
Google additionally faces competition in its role as the provider of platform technologies to users
and complementors within the Android ecosystem, as exemplified by its struggles with “platform
wannabes" such as Amazon and Alibaba. This struggle illustrates a fundamental tension that an
open source platform leader faces: balancing the occasionally conflicting needs of inter-network
competition with those of intra-network competition (Figure 10). The relative intensity of these
two different types of competition waxes and wanes over the course of platform evolution, and
the open source platform vendor is likely to find itself adjusting its position on the “Four Levers
of Platform Leadership” as a consequence.
In order to ensure that the Android ecosystem would attract the maximum number of
software and hardware complementors away from the incumbent leader, Google took pains in
the inception of the Android platform to ensure that it was architected in an open and modular
manner (Lever 2). It collaborated openly with hardware partners, software vendors and the open
source community (Lever 3). As Android establishes itself as the de facto leader within the
mobile space (nearly 85% of all smartphones shipped in Q2 2014 were Android-based [37]),
Google’s primary strategic concerns has arguably shifted from winning against competitive
platforms to sustaining its position as the primary beneficiary of Android’s success. While
Google’s decisions to share its technology have clearly contributed to Android’s dramatic growth
in the marketplace, they have also lowered the unique competitive advantages of Google as the
Android platform provider. As Google’s focus shifts away from alternative platforms to
alternative providers of the Android platform, its behavior also correspondingly changes.
One shift that Google has been slowly making pertains to its decisions around the
functionality that is delivered as a part of the open source “core” as opposed to the proprietary
services and extensions that it exclusively offers (Lever 2). In October of 2013, Ron Amadeo of
Ars Technica outlined the various functionality that Google delivers through proprietary channels
that it had previously included as part of core Android [38]. Much of this functionality were
35
Platform Leadership in Open Source Software
aided by Google’s unique proprietary cloud capabilities (such as Gmail or enhanced search) and
therefore Google’s decision to encapsulate it as proprietary extensions can be justified on a
technical basis. However, the decision to deliver self-contained enhancements, such as the
enhancements to the basic keyboards, appear to be deliberate decisions intended to further
differentiate the capabilities of Google’s Android versus alternatives offered by the community
or competing providers.
Capability
Open Source Version
Proprietary (Date Introduced)
Search
AOSP Search
Google Search (August 2010)
Music Player
AOSP Music
Google Play Music (May 2010)
Calendar
Calendar
Google Calendar (October 2012)
Keyboard
AOSP Keyboard
Google Keyboard (June 2013)
Camera
AOSP Camera
Google Camera (April 2014)
Messaging
AOSP Messaging
Google Hangouts (May 2013)
Table 8 - Google's shift of investment into proprietary capabilities. Content adapted from Ars Technica [38].
Google’s approach in interacting with external partners has also shifted as it seeks to
leverage its relationship in order to lock out other Android platform contenders (Lever 3).
Recognizing that its contributions to AOSP does not provide it with any legal means to minimize
the ‘forking’ of its core, Google created the Open Handset Alliance at the inception of Android
precisely to provide this means. As mentioned earlier, while ‘forking’ is a fairly normal and
desired phenomenon in open source projects, the OHA’s anti-forking restriction explicitly
prevents this from happening. There is a broad understanding in the industry that participation in
the OHA is a prerequisite for meaningful collaboration with Google on Android, and
consequently the majority of hardware device manufacturers are members of the Open Handset
Alliance. By putting in place this agreement, Google significantly limits the channels through
which alternative software platform vendors can create Android-derived products, as leading
hardware vendors participating in the OHA are restricted from collaborating with them. Amazon
36
Chapter 2 – Strategic Considerations for Open Source Leadership
experienced this when it searched for hardware partners to help build its Fire line of devices,
ultimately settling on an original equipment manufacturer with minimal prior exposure to the
mobile industry as a result. In the recent past, Google has been aggressive in the enforcement of
this agreement, going as far as threatening OHA member Acer Computers with the termination
of its Google Mobile Service license to prevent the hardware manufacturer from shipping a
device with Alibaba’s Yun OS despite some controversy about whether the Yun OS should
technically be considered a “fork” of Android [39].
Google’s changes in behavior illustrate the dynamic nature of managing the tension
between inter-network and intra-network competition. As mentioned earlier, open source
platform leaders must proactively hypothesize how the dynamics of competition will play out, in
order to put in place mechanisms that offer competitive leverage later. In the Android example
above, had Google failed to anticipate the emergence of “forks” such as those created by
Amazon, it would not have put in place the anti-forking clause that provided it with one of its
few means to deter a well-resourced and capable competitor from challenging it as Android’s
platform leader.
37
Platform Leadership in Open Source Software
Suppliers – Securing the Upstream Value Chain
The primary constraining ‘supply’ of the software industry is engineering talent.
Depending on the specific domain of software, the talent required may be highly specialized.
For example, IBM’s inception of the Eclipse project was made possible by the unique and highly
specialized competencies that they received through their acquisition of Objects Technologies
International. If the required engineering talent is scare, the possession of such human resources
is a significant barrier of protection for open source vendors even if their software is highly open
from a licensing perspective. Moreover, even in fields where capable talents are not scarce, the
structure of many open source projects impose limits on the supply of engineers who can
materially affect the design of a given project. In many open source projects of scale, access to
the main code-line is governed by relatively small group of individuals who have demonstrated
competence with that project. Depending on the project structure, this group may be known as
“committers”, “reviewers” or “maintainers”. More importantly, there are often official or de
facto technical leaders in most FOSS projects of scale who are responsible for making the major
design decisions; the authority of granting “committer” status to individual contributors is
sometimes also held by this group. Typically, this leadership group is kept fairly small. Securing
access to this group is therefore a critical determinant in an open source platform contender’s
ability to influencing its upstream value chain.
PMC Chair
PMC Member
• The Chair of a Project Management Committee (PMC) is appointed by the Board from the PMC
Members. The PMC as a whole is the entity that controls and leads the project. The Chair is the interface
between the Board and the Project.
• A PMC member is a developer or a committer that was elected due to merit for the
evolution of the project and demonstration of commitment. They have write access to the
code repository, an apache.org mail address, the right to vote for the community-related
decisions and the right to propose an active user for committership. The PMC as a whole
is the entity that controls the project, nobody else.
• A committer is a developer that was given write access to the code
repository and has a signed Contributor License Agreement (CLA) on file.
They have an apache.org mail address. Not needing to depend on other
people for the patches, they are actually making short-term decisions for
the project. The PMC can (even tacitly) agree and approve it into
permanency, or they can reject it. Remember that the PMC makes the
decisions, not the individual people
Committer
• A developer is a user who contributes to a project in the
form of code or documentation. They take extra steps to
participate in a project, are active on the developer
mailing list, participate in discussions, provide patches,
documentation, suggestions, and criticism. Developers
are also known as contributors .
Developer
Figure 11 - Hierarchy of influence within an Apache Software Foundation project. Adapted from the ASF [40]
38
Chapter 2 – Strategic Considerations for Open Source Leadership
There is tremendous variety with regards to the contribution model and distribution of
decision-making authority amongst open source projects (Table 9). Aspiring open source
platform leaders must understand the decision-making structure for their prospective community
in order to secure the resources required to affect the technological trajectory of their platform
(Lever 2). For example, for projects governed by the Apache Software Foundation, “committer”
status is relatively scarce and is granted by the Project Management Committee (PMC), which is
also responsible for resolving the major design decisions affecting the project (Figure 11). As a
consequence of this, aspiring platform firms that wish to affect the technological trajectory of the
project must secure some critical mass of individual committers as well as adequate
representation within the PMC. Given that PMC members “are participating as individuals…
affiliations do not cloud the personal contributions”, this means that Apache-based platform
firms must retain the services of the specific individuals who already reside on the PMC if the
firm wishes to influence the design of the technology. In contrast, the Linux development
process operates on a much more open “bazaar” style basis, with the majority of design decisions
being made via publically accessible mailing lists and the only ‘governance’ process being the
actual mechanics by which a “maintainer” reviews and integrates individual submitted patches to
the mainstream code-line. It is entirely possible for a firm to hugely affect the design trajectory
of Linux without employing anyone who is an official “maintainer” of a Linux module. This
open decision making process results in a significantly larger supply of engineering resources
who can make substantial contributions when compared to the more constrained pool of
committers in an Apache-governed project.
Regardless of their specific organizational structure, most open source communities
define themselves as being transparent meritocracies. This means that authority and influence
within the community arise as a consequence of demonstrated contributions within the
community rather than role or rank assigned by some “higher” authority. This leads to the
emergence of de facto technology leaders in most open source communities. Paradoxically, the
“openness” of this meritocratic philosophy actually greatly restricts the supply of talent that an
aspiring platform leader can acquire in order lead the design trajectory of an open source
platform at any given time. While a proprietary platform leader can bestow any capable
candidate with the authority to lead the technical direction of a proprietary platform, an aspiring
open source platform must look to employ an established leader of the community if it wishes to
39
Platform Leadership in Open Source Software
secure influence over the trajectory of the platform. Therefore, the attraction and retention of
highly-visible community leaders is a critical aspect of establishing and maintaining platform
leadership in the open source world.
Community
Authority
Official Description
Apache
Project Management
“The PMC is the vehicle through which decision making
Software
Committee (PMC)
power and responsibility for oversight is devolved to
developers.”[41]
Foundation
PMC Chair
“[The] chair is a facilitator and their role within the PMC
is to ensure that everyone has a chance to be heard and to
enable meetings to flow smoothly.” [41]
Eclipse
Project Management
“ensure that their Project is operating effectively by
Software
Committee
guiding the overall direction and by removing obstacles,
Foundation
Project Lead
solving problems, and resolving conflicts;”
“ultimately responsible for ensuring that the Eclipse
Development Process is understood and followed by
their project”[42]
Architectural Council
“responsible for… monitoring, guiding, and influencing
the software architectures used by Projects” [42]
Planning Council
“The Planning Council is further responsible for crossproject planning, architectural issues, user interface
conflicts, and all other coordination and integration
issues.”
Linux
Maintainers
development tree, or returned for revision” [43]
Foundation
Mozilla
Foundation
“determines whether the code should be accepted into the
Module Owners
“responsible for leading the development of a module of
code or a community activity” [44]
40
Chapter 2 – Strategic Considerations for Open Source Leadership
Release Drivers
“provide guidance to developers as to which bug fixes
are important for a given release and also make a range
of tree management decisions.” [44]
Super-Reviewers
“approval of a super-reviewer is generally required to
check in code” [44]
Ultimate Decision-
“The ultimate decision-maker(s) are trusted members of
Makers
the community who have the final say in the case of
disputes. This is a model followed by many successful
open source projects, although most of those
communities only have one person in this role, and they
are sometimes called the ‘benevolent dictator’” [44].
Table 9 - Decision Making Authorities in different Open Source communities
It is worth noting that Google sidestepped many of these issues in its establishment of the
Android Open Source Project. Although the source code of Android is publically published and
contributions from the community are welcome, the fact that each new version of Android is
designed behind closed doors at Google means that some of the communal and meritocratic
nature of open source development is absent from Android’s development. As a consequence,
although Google does not benefit from the power of community development that motivates
most open source projects, it is also not hindered by the constraints that community development
imposes.
As most open source platform projects are complex efforts composing of a hierarchy of
dependent sub-projects, platform vendors must thoroughly understand the architecture of the
platform and formulate their position on which sub-projects are strategic in order to secure the
right talents for affecting the platform. For example, at the time of writing, the Eclipse Platform
consists of twelve top-level projects, which are in turn composed of 243 sub-projects [45]. A
firm wishing to become a platform leader based on Eclipse must decide which of these 243
projects materially affect the platform from the perspective that matters to it and invest its
engineering resources appropriately. Each platform provider within the same ecosystem might
41
Platform Leadership in Open Source Software
hold a different perspective on which modules are most critical depending on its hypothesis of
which sides of the network its wishes to focus on. In other words, aspiring platform leaders are
most likely interested in the projects that represent external interfaces of platform complements
that they see as most strategic, or modules that represent user interfaces if they see driving user
adoption as most critical.
The extent to which a firm may find it necessary to invest in the internal ‘core’ of a
platform significantly depends on the maturity of the platform and the level of inter-platform
competition. If the platform is relatively immature and unstable, and the level of inter-network
competition is intense, a platform would find it necessary to focus their efforts on acquiring the
talents need to stabilize the core and make the platform more viable. As a platform reaches
maturity and the focus of competition shifts from inter-network competition to intra-competition,
platform vendors may find it less important to invest in the ‘core’ technologies but rather focus
their energies in affecting the peripheries that act as interfaces into the platform or in delivering
capabilities that differentiate their specific versions of the platform. Generalizing from all of
this, it becomes clear that how a firm chooses to collaborate with the open source community
significantly impacts whether (and how) it can secure the critical resources necessary for success.
42
Chapter 2 – Strategic Considerations for Open Source Leadership
Complementors – Identifying and Securing Critical Complements
Beyond engineering talent, platform builders require the supply of key complements in
order to make their platform viable. While aspiring platform leaders seek to engage a large
number of industry participants to their platforms and provide complementary products or
services, it is sometimes the case that providers of specific types of key complements are few or
even nonexistent. In such cases, the platform leader may choose to intervene by either providing
extraordinary support for those complementors or by directly participating in the complements
ecosystems itself to boost the supply of the required complements. For example, Cusumano and
Gawer documented Intel’s creation of the Content Group in order to help spur the creation of
scarce multimedia software at the time. The pair also documented Intel’s venture into the chipset
and motherboard business after finding that the existing vendors in the business were not keeping
up with the needs of the platform [10]. However, beyond reinforcing the ecosystem to enable
network effects, controlling key complements through intervention and involvement in
complement creation can also arm a platform leader with competitive leverage against alternative
platform vendors.
The management of key complements is a tactic that is well recognized and overtly
managed in certain markets such as the video games consoles. In their survey of the various
strategies of video game console makers from 2005 through to 2007, Daidj and Isckia found that
Microsoft relied heavily on the advantage gained by exclusives such as the Halo franchise to
drive the adoption of their platform, the Xbox 360 [46]. Perhaps more interestingly, James
Prieger and Wei-Min Hu pointed out that in a separate paper on the same industry that possessing
exclusive complements are only effective in driving platform adoption if the majority of
complements available are non-exclusive. Moreover, the pair found that a “small amount of
exclusivity… would be enough to foreclose competitors from all the important sources of supply
of the complementary good”[47].
The leverage afforded by the exclusive access to key complements is particularly relevant
in the intra-network competition for open source platform as there is generally a lowered level of
differentiation amongst vendors of the same platform and also no technical barriers creating
differences in ecosystems. In other words, open source platform vendors often compete within
the conditions for effective complement-exclusivity advantages that Prieger and Hu identified.
43
Platform Leadership in Open Source Software
This fact has also manifested itself in the Android case, where Google chose to withhold
its complementary mobile services (e.g. Gmail, Maps, Google Now etc.) from Android variants.
Given that Google makes the complementary services within its Google Mobile Services
portfolio available to even competing platforms such as iOS, it may initially appear odd that
Google would refuse to provide these capabilities to other Android systems. However, Google’s
decision reflects the different competitive dynamics of inter-network and intra-network
competition. Since iOS and Android are fundamentally different platforms with significantly
differences in capabilities and ecosystems, the availability of Google Mobile Services is less
likely to materially affect a customer’s relative preference for Google’s platform in comparison
to Apple’s. In contrast, given that Amazon’s Fire OS and Google’s Android are very similar
technical platforms with a much smaller difference in capabilities and are application ecosystems
with significant overlaps (applications that are available on Android can run on Amazon’s Fire
devices if Google’s proprietary extensions are not used, or if the developers substitute Google’s
services with Amazon’s offering). Therefore, the availability or absence of Google’s class
leading services may materially affect consumer preferences for one vendor’s Android variant
over another. In light of this understanding, Google’s decision to withhold its Google Mobile
Services (e.g. Gmail, Google Maps, Search) from users of alternative Android platform appears
to be a sensible and strategic means of creating greater differentiation between its platform and
those of its intra-network rivals.
Unlike Google, most firms do not have the luxury of possessing exclusive ownership to
key platform complements and often have to invest in cultivating relationships with partners just
to secure access for their platforms. In order to do so in a cost-effective and timely manner,
aspiring open source platform leaders must form clear hypotheses on the critical type of
complements in order to secure superior access either through internal development or by
developing partner relationships.
44
Chapter 2 – Strategic Considerations for Open Source Leadership
Buyers – Controlling the Path to the Customer
As the right-hand side of Figure 10 illustrated, an open source platform facilitates the
technical connection between customers and complement creators without consideration of who
is providing the underlying platform. For example, Android application developers can largely
be assured that their product can technically be sold to users of Google’s Android as well as
Amazon’s Fire OS with only a modest amount of additional investment. In other words, the
technical platform is often undifferentiated from the perspective of the complement creator, even
if the provider manages to differentiate its platform variant to end consumers. In fact, the
complement creator prefers a greater level of commonality across different platform variants in
order to minimize the amount of customization for its products. As a result of this, a platform
provider needs to find other means for differentiating itself from the other providers for the same
platform. One way that a platform provider can differentiate itself from its rivals is by
controlling the complementor’s path to platform customers.
Depending on the market that the platform serves in, the relationship between the
platform provider and the end customer may vary greatly in nature and intensity. For example,
in enterprise software, vendor-customer relationships tend to be highly intense as vendors tend to
have relative few customers, each representing non-trivial fraction of a vendor’s revenues. As a
result, each customer holds substantive bargaining power. While such a structure may appear to
weaken the bargaining position of platform vendors, such a structure actually provides important
leverage for platform vendors to affect the behaviors of complementors. If an open source
platform provider is able to forge strong and exclusive relationships with key customers, and
there are relatively few customers on the market, it can act as an effective monopoly on the
ecosystem from the perspective of the complementor even if it is not the exclusive provider of
the platform technology.
45
Platform Leadership in Open Source Software
Figure 12 – The Purchase Process of Complements – While an open source platform enables a complement provider to provide
its products to a given customer, the onus is still on the complement provider and customer to discover each other. If a platform
provider can facilitate this connection in a superior manner, it can differentiate itself from alternative platform providers.
Beyond establishing exclusive relationships to key customers, a platform provider can
also assert influence over complementors by positioning itself as a facilitator in the purchase
process of complements (Figure 12). Although an open source platform provides a unified
technical infrastructure that binds an ecosystem together, the existence of a unified platform does
not imply the existence of a unified purchase process. In other words, while a complementor's
product can be delivered to the customer thanks to the technical infrastructure provided by the
platform, the business process of discovering, evaluating and purchasing the solution is not a
problem resolved by the existence of an open source platform. Consequently, if a platform
provider can facilitate this process better than other platform providers and better than network
participants can accomplish on their own, that platform provider is able to create a preference for
its platform variant over others.
As it turns out, this has also been an important tactic in the intra-network competition
between platform providers in the Android ecosystem. Each of the different Android platform
providers have invested in their own proprietary application marketplaces in order to facilitate
the purchase of applications by customers. While application vendors are able to create and
distribute applications on their own, these application marketplaces provide customers with a
simpler and faster means of discovering and purchasing new complementary “apps”. Although
alternative third-party marketplaces exists for the same purpose, the marketplace offered by a
platform provider has the distinct advantage of being pre-installed on the devices that ship with
its platform. In order to secure superior access to these channels, the application vendors are
compelled to establish sometimes exclusive relationships with a specific platform provider, even
if there is no technical reason for it.
46
Chapter 2 – Strategic Considerations for Open Source Leadership
While some may argue that the primary motivation of creating an application store is to
monetize the activities within a platform ecosystem, Google has made decisions that appear to
contradict such an objective. For example, the company limits access to its electronic
application and content marketplace (“Google Play”) to devices produced by manufacturers that
it approves (effectively manufacturers that participate in the OHA) [48]. If the company’s
objective was to capitalize on the transactions that occur within the Android ecosystem, it would
likely have taken a similar approach to Amazon; Amazon not only makes it App Market
available on its Fire devices, but also Google-approved Android devices as well as Blackberry 10
devices [49]. Despite Amazon’s efforts, Google’s “Play” store is the largest marketplace for
Android-compatible applications with an estimated 1.3 million applications compared to
Amazon’s 240,000 as of June 2014 [50]. Having exclusive access to such a vibrant marketplace
affords Google’s Android offering with a substantial competitive advantage over its intranetwork competitors. Amazon’s response of its own App Market was not purely a mean of
matching Google’s channel for delivering complements to its platform consumers, but also a
means of giving the company leverage to influence the architecture and technical interfaces of a
platform it otherwise has relatively little influence over. Developers who choose to sell their
products on Amazon’s App Market need to ensure that their apps work with Amazon’s devices,
which in turn requires the substitution of Google’s proprietary services and APIs (e.g. Google
Maps) with Amazon’s version.
The aforementioned mechanisms for platform leverage rely upon the platform provider’s
involvement in the value chain between a complement producer and the customer beyond
supplying the pure technical infrastructure offered by the platform. However, a platform
provider can also deter competitors by studying the value chain between the platform itself and
the customer. It is often the case that the path between the platform provider and the customer is
actually an indirect route controlled by a few intermediaries. Consequently, an open source
platform provider can attempt to recreate the effective monopoly of a proprietary platform
vendor by establishing exclusive relationships with those intermediaries. This was clearly a
tactic utilized by Google in an attempt to limit the fragmentation of the platform through the
Open Handset Alliance. As mentioned earlier, a customer of the mobile operating system
typically adopt a given platform by purchasing a device from one of several major device
manufacturers. By securing effectively exclusive relationships with those hardware device
47
Platform Leadership in Open Source Software
manufacturers through its Open Handset Alliance program, Google greatly restricts the extent to
which alternate platforms providers can displace it as Android’s platform leader.
Google’s ability to enforce its platform leadership through the OHA program hinged
upon the company’s identification of hardware vendors as the critical nodes on the value chain to
customers in the mobile industry. In the highly intertwined marketplace that many open source
platform vendors compete in, the nodes and relationships connecting the platform vendors to the
customers can be complicated, often times resembling a network rather than a chain. In the case
of Android, Google additionally identified network providers and implementation partners (such
as Accenture or Wipro) as nodes on the path to the customer and have consequently included
such firms in its Open Handset Alliance program. The preemptive identification of these critical
nodes allowed Google to establish superior relationships with them, and put in place legal
agreements that ensures exclusivity (i.e. the “anti-forking” clause of the OHA).
Much like how different firms may come to different perspectives on what modules
within the platform are most strategic, different firms may also hold different hypotheses on
which relationships and complements are most critical to establishing control over the
ecosystem. The hypothesis held by the firm on which types of relationships (or even which
specific relationship) are most critical to manage may even significantly impact the firm’s “scope
of the firm” (Lever 1) decisions as firms seek to avoid conflict with key members of the business
network. The relationships that affect the purchasing process of the platform and of complements
are amongst those most critical to establishing ecosystem influence. While the strategic
management of these external relationships are also important to proprietary platform vendors,
this dimension of platform management is especially critical to the open source platform vendor
in light of intra-platform competition with alternative vendors. Given that open source platform
vendors are technically replaceable, the correct identification and possession of those key
relationships serve as a critical means of asserting platform leadership for open source firms.
48
Chapter 2 – Strategic Considerations for Open Source Leadership
Substitutes and New Entrants – The Threat of Shifting Platform Boundaries
Beyond competing with alternative platforms and rival providers of the same platform,
platform vendors must consider the threat of substitute technologies that can be considered
alternatives to adopting a platform altogether. While all product firms – platform or otherwise,
proprietary or not – face the same threat of substitution, platform firms in general and open
source vendors in particular need to be specifically aware of alternatives that can emerge from
changes in the definition of platform boundaries.
As mentioned earlier in the literature review, scholars have found that “platform
envelopment” is one of the most effective strategies for displacing an entrenched platform. In
particular, the “foreclosure attack” can be viewed as a redefinition of the platform boundaries to
a vastly larger scope, substituting the need of a specific platform with capabilities integrated into
a broader platform with known demand. One can reason that open source platforms appear to be
more susceptible to this type of substitution as there are lowered technical barriers for an attacker
to integrate the capabilities of an open source platform into the context of a broader one.
While all platform vendors face the threat of envelopment, open source platforms are
perhaps uniquely susceptible to the threat posed by shifting platform boundaries. An open source
platform is typically a complex system of subsystems loosely connected through a network of
related projects. As a consequence, enterprising vendors or members of the community can
choose to re-interpret platform boundaries and create new offerings bundling different subprojects together. Although proprietary platforms are also complex compositions of smaller
subsystems, proprietary intellectual property holders possess the unique ability to determine how
these internal subsystems are bundled together. For example, a customer cannot choose to adopt
just one aspect of the Apple iOS operating system without purchasing the entire platform. The
entire definition of what it means to consume the platform is at the discretion of Apple,
motivated at least partially by the business objectives it faces at any time. No other participant
within the ecosystem holds the power to define platform boundaries.
49
Platform Leadership in Open Source Software
Figure 13 - The ability for vendors to create distributions can have the undesirable effect of fragmenting the ecosystem if there is
variation between platform products that impact platform users. In the above example, distribution 1 and 2 are variants of the
platform produced by different open source vendors.
In the open source world, members of the community does not only have the ability to
substitute one implementation of a subsystem with another, but also to define the boundaries of
the platform differently. If a member of the community believes a specific subsystem is useful
and “core”, it can choose to bundle it as its own “distribution” of the platform. As mentioned in
the literature review of this paper, a license that enables distribution creation is a fundamental
criteria for a software to be considered “open source” and was the original business model upon
which the largest open source business in history (Redhat, Inc.) was based [24]. Distribution
creation creates variations in the definition platform which can blur platform boundaries and can
fragment the ecosystem if the components being varied touch interface points with platform
consumers.
Figure 13 illustrates this with a hypothetical open source platform with two distributions.
Distribution 2 varies from distribution 1 in that subsystem 2 and 4 have been substituted with
subsystems A and B respectively. While the replacement of subsystem 4 with subsystem B is
simply technical implementation decision that does not impact platform users, substituting
subsystem 2 with A can mean the Type A complements created for distribution A do not work
with distribution B, fragmenting the ecosystem and compromising the strength of indirect
network effects. Similar fragmentation can occur if the distributions introduced or omitted
interfaces differently, creating a fuzzy platform boundary. As an example, the two major variants
of the desktop operating environment (Gnome and KDE) distributed by major distribution
50
Chapter 2 – Strategic Considerations for Open Source Leadership
creators fragmented the interfaces that desktop application developers used to create desktop
applications for Linux until a common interface was established through Project Portland by the
Desktop Linux Working Group [51]. This type of platform fragmentation reduces network
effects and harms the platform’s ability to compete with alternative platforms.
Beyond the challenge of fragmentation, open source platform leaders also face the
possibility for key subsystems of the platform to be reused in other contexts by other firms or
individuals as a means of “hijacking” the platform’s ecosystem. Amongst the large number of
subsystems and modules within a software platform, it is often the case that only a fraction of
those subsystems are materially involved in enabling interactions with a given type of
complements. If a firm is able to isolate these core subsystems, it can reuse these modules to
allow those same complements to interact with another product or even platform. While such an
approach is theoretically possible with proprietary platforms with open interfaces, the technical
barriers to execute such a tactic is extraordinarily high. Competing vendors who want to
leverage complements built for a specific proprietary platform must reverse engineer the
implementation underlying the platform based on the interface definition and replicate the
behaviors of the base platform. Depending on the implementation technology and the degree of
coupling between the complement and platform, this is a task that ranges from difficult to
effectively impossible. However, within the open source world, a firm seeking to ‘hijack’ a
platform’s complements does not need to perfectly replicate the behavior of some unknown
black box, but rather directly modify and integrate the core components required.
As an example, Figure 14 shows a hypothetical platform comprising of subsystems one
through five. Suppose that Type B complements are desirable to an alternative platform product
competing in an adjacent platform market or an inter-network competitor. Given that subsystem
2 is the only interface that Type B complements interact with, a competitor can simply integrate
that subsystem into its own platform and offer compatibility and support with Type B
complements. Since subsystem 2 requires supporting capabilities from subsystem 3 and 4, the
competing vendor can choose to also integrate those components into its product, or replace
those subsystems with its own.
51
Platform Leadership in Open Source Software
Figure 14 – Ecosystem hijacking - competing vendors can isolate the interfacing and enable subsystems for a given complement
type and choose to redeploy it in another context to expedite their own objectives.
Google experienced this hijacking phenomenon with the Android platform. In case of the
Android platform, the application framework represented by the Android Software Development
Kit (SDK) and the execution environment known as the Dalvik Runtime are the primary
components involved in supporting the consumption of Android applications on the platform
(left-hand side of Figure 15). While the application framework does depend on capabilities
provide by other core libraries within the Android platform, the Dalvik module represented the
bulk of the required complexity. Blackberry was able to integrate Dalvik into its own proprietary
Blackberry 10 OS based on the QNX kernel it had acquired. In doing so, Blackberry was able to
bootstrap its own ecosystems by supporting Android applications and vastly reducing the cost of
multi-homing for application vendors to support its platform [52].
52
Chapter 2 – Strategic Considerations for Open Source Leadership
Figure 15 – High-level System architecture of Android and Blackberry OS 10. Adapted from
http://developer.android.com/images/system-architecture.jpg
While numerous other factors have prevented Blackberry from meaningfully contending
for a platform leadership position, the fact Blackberry was able to leverage Android’s success to
substantially enhance the competitiveness of its own proprietary platform illustrates the viability
of this hijacking tactic for the aggressor, and the threat it poses to the platform incumbent. It
should be possible that Blackberry’s tactic would have been nearly impossible or illegal with a
proprietary platform such as iOS.
The lack of control over an open source platform’s design and architecture (Lever 2)
creates significant risks for a platform leader. These risks require constant monitoring and
management. The common open source business model of distribution creation has the potential
of fragmenting the platform and reducing network effects. Distribution creation can also shift
platform boundaries, rendering key assets and assumptions held by the firm invalid by including
or excluding modules. The availability of source code and the ease with which a platform can be
decomposed into individual parts also means that open source platforms can be “hijacked” as
competitors can repurpose key subsystems for competing purposes. The open source platform
leader needs to remain vigilant in identifying such threats and ensuring that it has
countermeasures to defend against them.
53
Platform Leadership in Open Source Software
Chapter Summary
The mobile operating system space is a highly competitive market involving some of the
technology industry’s most powerful players. The fact that the leading platform (by market
share) is an open source late entrant is a testament to the increasing relevance of the open source
model in the modern computing industry. However, Google’s Android project is an
unconventional open source project in that the contributions of the open source community is
largely limited; Google explicitly chose not to take advantage of the talent within the open source
community in order to maintain greater control over the trajectory of the platform. This decision
clearly indicates that Google does not perceive the primary reason for participating in open
source to be the ability to leverage the resources of the open source community, but rather some
other attribute. With the reasonable assumption that the profit-seeking corporate entity known as
Google is not releasing the intellectual property behind Android for purely altruistic reasons, one
can reasonably infer that the decision to adopt the open source model stems from a desire to
accelerate adoption on both sides of the platform and to catalyze network effects.
While the use of an open source model can remove adoption barriers for platform users,
particularly on the side of complement producers, the forfeiture of intellectual property rights
and design authority significantly limit the means that a platform leader can use to direct the
trajectory of the ecosystem for its own benefit. As a result, platform contenders must find
alternative means of exerting their influence. In order to do so, contenders must first determine
whether inter-network competition (winning against alternative platforms) or intra-network
competition (winning against alternative providers) is the more immediate need and then form a
perspective on how this may shift over time. In addition, the firm must stay abreast of shifts in
the perceived platform boundary, which can expand or contract without their approval, and
ensure that it has the means to remain the primary benefactor of the platform’s continued growth.
An understanding of the above two factors will shape the behavior of the vendor with regards to
how it interacts with the key suppliers (engineering talent and complement providers) and the
extent to which it intervenes in the purchasing process of complements and the platform itself.
Google’s experience with Android and its challenges in maintaining control of the
platform that it sponsored illustrates the many challenges of managing an open source platform.
Despite the fact that Google has chosen to adopt a relatively closed development model for
54
Chapter 2 – Strategic Considerations for Open Source Leadership
advancing Android, competing vendors are fracturing the platform in a manner that is
incompatible with Google’s business objectives. Given that these competing efforts are
completely legal from a licensing perspective, one of Google’s few means of influencing the
ecosystem comes from its control over the key complements it controls within its Google Mobile
Services portfolio. By controlling that key asset and leveraging its initial exclusivity of Android
development expertise, Google is able to strike critical agreements with members of the value
chain in an effort to block out alternative platform providers. These agreements help Google
remain the de facto platform leader for Android, despite the fact that powerful rivals have
emerged.
Perhaps the most surprising lesson from the Google case study is that open source
platform leadership may require access to substantial complement assets and capabilities. Given
that Google is largely staffing the development of Android with its own employees with little
contribution from the community, it appears fair to assert that the decision to open source
Android has not reduced the amount of effort for Google to launch its own mobile platform
offering. However, it is unlikely that Android would have experienced its level of success if it
was launched as a proprietary offering; therein lies the double edged sword of an open source
platform strategy.
55
Platform Leadership in Open Source Software
This page is intentionally left blank.
56
Chapter 3 – A Case Study on Hadoop
History and Origins
Doug Cutting and Mike Cafarella was struggling to solve major scalability problems with
their open source web search engine project, Apache Nutch, when Google Engineers Jeffrey
Dean and Sanjay Ghemawat published their paper on MapReduce in December 2004 [53]. Their
implementation of the MapReduce idea ultimately led to the creation of the ‘big data’ platform
now known as Apache Hadoop.
Search engines such as Nutch need to traverse billions of pages in order to generate a
lookup data structure known as a search index, and this is a computationally expensive endeavor
that require the storage and processing of an enormous amount of data. Given modern hardware,
such a challenge that could only be reasonably tackled if the work were massively parallelized
between hundreds or even thousands of computers (also known as nodes) working in a
coordinated manner. The complexity of managing this type of large-scale distributed computing
was beyond what Cutting and Cafarella were able to tackle as part-time open source software
developers. The MapReduce paper represented an elegant solution to this problem by offering a
simple programming model for describing parallelizable processing algorithms and a framework
for executing them. This paper, in combination with a previous paper on the Google File System
[54], described a robust and general-purpose distributed data processing platform for exactly the
type of batch processing that Nutch was doing. Recognizing this, Cutting and Cafarella
implemented the ideas described in the papers using the Java programming language and ported
the major algorithms in Nutch to this framework. This effort allowed Nutch to scale
significantly beyond what the pair had been able to achieve with their previous homegrown
efforts.
Around the same time, internet search provider Yahoo! was prototyping a redesign of its
own distributed processing infrastructure called “Dreadnaught” based on the same MapReduce
and GFS papers under the leadership of Eric Baldeschwieler. After discovering Cutting and
Cafarella’s effort with Nutch, the firm decided to abandon its internal development and adopt the
pair’s work. According to Owen O’Malley, a founding member of Hadoop-vendor Hortonworks
and a member of the original Yahoo! team, there were two main reasons that the team abandoned
57
Platform Leadership in Open Source Software
its efforts in favor of what was being done in Nutch. Firstly, Cutting and Cafarella’s
implementation was already proven to scale out to dozens of machine in Nutch, while Yahoo!’s
efforts were less mature and unproven. Adopting Hadoop would allow the Yahoo! team to roll
out a cluster of machines for its research staff to experiment with immediately. Secondly, the
individual developers on the Yahoo! team had a preference for working in open source and they
had an easier time convincing the firm’s legal department to do that with Nutch than with
Dreadnaught, since Nutch was already available to the open source community [55]. Yahoo!’s
decision to embrace the Nutch framework was aided by a trio of supportive executive sponsors
in Qi Lu, Jan Pederson and Raymie Stata, who were leading the search division at Yahoo! in
different capacities at the time. In particular, Stata was a director on the board of the Nutch
Foundation and was familiar with both Hadoop and the team behind it.
Yahoo! hired Doug Cutting in January 2006 and spun Nutch’s distributed processing
framework into its own Apache open source project a month later. Cutting arbitrarily named the
project after his young son’s toy elephant and Hadoop was born.
58
Chapter 3 – A Case Study on Hadoop
Hadoop and the Big Data Phenomenon
Today, Hadoop is associated with the phenomenon known as “Big Data”. The term “Big
Data” is attributed to John Mashey of Silicon Graphics and is used to refer to both the
opportunities of, and the challenges with, the rapid growth of available data [56][57]. Doug
Laney of the META Group published a research note in 2001 entitled “3D Data Management:
Controlling Data Volume, Variety and Velocity” which identifies three primary dimensions that
drive the complexities of managing “big data” [58]. In the note, Laney points out that
conventional approaches to data management have limits along each of these dimensions. Data
that exceeds these limits require novel techniques to be employed. In 2012, Laney (then at the
Gartner Group) published an often-cited definition of Big Data: “Big data is high volume, high
velocity, and/or high variety information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization."[59].
Characteristic
Description
Volume
The amount of data that needs to be stored, processed and analyzed.
When this quantity is increased dramatically, conventional storage and
processing techniques either fail disastrously or perform unacceptably.
Variety
The different types of data that needs to stored and processed.
Conventional data management has been focused on the management of
structured, tabular data generated by transactional systems.
Velocity
The speed at which data needs to be stored, processed or retrieved.
Table 10 - The Three V's of Big Data
With the advent of the internet and various connected, sensor-enabled machines, new data
sources have emerged that significantly increase the requirements on all three dimensions.
Moreover, the value captured within these new sources of data is unknown until they are
analyzed. While there is often general acceptance that there is value locked within these ‘big
data’ sources, the specific means through which that value is unlocked is often unknown at the
59
Platform Leadership in Open Source Software
time of data collection. With conventional data management technologies, the cost of collecting
such data is often too high to justify the upfront investment for collecting the data. In their 2012
survey, Forrester Research found that 88% of data collected by enterprises were discarded
because the organization could not justify the costs of collecting it. Hadoop addresses this
problem by providing a cost-effective, flexible and scalable platform for collecting and analyzing
data with minimal upfront costs. Understanding how Hadoop resolves this problem requires a
rudimentary understanding of how a Hadoop-based data management approach fundamentally
differs from conventional data management technologies such as the relational database.
The Relational Database
Since the late 1970s, the dominant design for conventional database management systems
has been the relational model. Invented by IBM computer scientist Edgar Codd in the early
1970s, the relational model offers a flexible means of storing information by handling all data as
tuples (i.e. rows in a table) and relations [60]. This model succeeds earlier designs such as the
hierarchical model or the network model and offered a more flexible and efficient means of
representing a wide variety of different data structures. The industry standard for interfacing with
relational database management systems is the Standard Query Language (SQL). SQL is a
declarative programming language, which means that it allows applications or users connecting
to the database to describe what they would like to store, read or write from the database without
having to tell the system exactly how to complete that operation. Systems built on the relational
model are known as Relational Database Management Systems or RBDMS.
In order to store data within an RBDMS, users must first define the structure of the
tables, specifying the types of data that will be stored and describing their relationships. This data
model is also known as a “schema”. The existence of a defined schema helps a relational system
enforce data integrity and enable the efficient storage and processing of data. This requirement
to have a defined schema prior to the storage of data is known as “Schema-on-Write”. In
general, defining a robust data model that meets the needs of the application and the business is a
time-consuming task for a sophisticated database designer. This process creates an upfront cost
for businesses that must be paid before the first bit of data is collected and processed by the
system.
60
Chapter 3 – A Case Study on Hadoop
Beyond designing the schema, businesses implementing RBDMSs must also estimate the
quantity of data to be collected, along with the pace at which data will be read or written in order
to determine the combination of hardware and software that is needed to handle that load. This
is known as “system sizing”. While it is possible to “scale out” RBDMSs retroactively after a
system has been implemented and rolled out, this is generally a costly and difficult proposition
for conventional relational systems due the fact that they are optimized for vertical and not
horizontal scalability.
According to Nikita Shamgunov, CTO of MemSQL, “enterprise-class database systems
run well on powerful hardware, and there are many forces within the industry aligned to make
this happen — not just software vendors, but also hardware manufacturers who want to show
how many more transactions per second they can push on new hardware." [61]. As a result of
this, significant engineering efforts have been invested into ensuring that RBDMSs possess the
ability to leverage increased capacity on a single machine (“vertical scalability”). However,
increasing the storage or processing power of an individual machine is not always a feasible
option due to the limits in hardware. Expanding the capacity of the overall system by
introducing additional machines (i.e. “horizontal scale out”) is challenging for conventionally
designed database systems. Typically, such efforts require the repartitioning and redistribution of
data in order to account for the new machines. This can introduce lengthy and costly
interruptions to the operations of the system. As a consequence, conservative sizing practices are
often adopted for relational databases, which further increase their upfront costs. Moreover,
even conservative sizing is a difficult proposition for new and emergent “big data” sources
whose eventual volume and velocity is unknown.
Hadoop to the Rescue
Hadoop offers a fundamentally different approach to the data management problem than
conventional RBDMSs. While the Hadoop platform itself is a generic distributed storage and
computing framework, various database management systems have been built on top of this
framework (e.g. Apache HBase, Accumulo). These systems generally fall in the class of
“NoSQL” (“Not Only SQL”) designs and are not relational in nature. Hadoop usage for “Big
Data” also involve persistence and processing of data through raw files which would not be
classically considered as database systems by computer scientists. Neither of these usages of
61
Platform Leadership in Open Source Software
Hadoop require any significant pre-emptive modelling. Instead, the data can be persisted “raw”
in the native output format of the data producer, and the “schema” provided at the time that the
data is retrieved. This approach is known as “schema-on-read” or “late-binding”. As the
structure of the data is not provided at the time of persistence, a schema-on-read system is unable
to enforce the consistency or integrity of the data, nor can it optimize the storage for retrieval in
the manner that “schema-on-write” systems do. However, this approach allows for the deferral
of the significant upfront modelling costs associated with onboarding a new source of data with
relational systems.
Due to the fact that the workload required by Google grew at a rapid and unpredictable
pace, MapReduce and the Google File System were designed to allow the company to add
capacity in a cost-effective and flexible manner. The systems were built to run on hundreds or
thousands of inexpensive ‘commodity’ machines, rather than a few expensive but powerful
‘server’ computers, to enable cost-effective and incremental capacity increases. Moreover, the
frameworks were designed so that additional machines could be introduced to the system with
little to no interruption to system operations and minimal human intervention. This property of
“easy scalability” was inherited by Hadoop.
Hadoop’s easy scalability, in combination with the low upfront onboarding costs enabled
by its “schema on read” approach, makes Hadoop an attractive option for the persistence of new
sources of “big data” whose value and magnitude is yet to be understood. Firms can costeffectively persist data inside Hadoop without being immediately concerned about how they
would use it and be reasonably assured that Hadoop would scale with their needs. Hadoop also
equips them with the flexible processing framework needed to extract the value from the data
when the time comes. This approach to the management of “Big Data” has become so popular
recently that the slang “hadump” has been recently coined by some industry observers to mock
the fact that many Hadoop systems have become “dumping ground(s)” of unused data.
Despite these criticisms, the fact that Hadoop offers a cost-effective solution to the
problem of new emergent sources of data is genuinely valuable in light of the explosion of
available data. Though other distributed and NoSQL technologies exist, Hadoop has become the
leading platform in the big data management space, especially for analytical use cases. In their
2014 research report, Forrester Research characterized Hadoop as “a must-have data platform for
62
Chapter 3 – A Case Study on Hadoop
large enterprises, forming the cornerstone of any flexible future data management platform” [62].
By 2015, the Gartner Group estimates that roughly two-thirds of analytical applications will have
integrated Hadoop capabilities [63].
Incumbent enterprise software vendors such as Oracle, IBM and Teradata have taken
notice as Hadoop is increasingly challenging their products as the “center of data gravity” in
enterprise datacenters. According to a 2014 survey by Wikibon, Hadoop had displaced
traditional data warehouses for some workloads in 61% of those surveyed; another 34% is
expecting to shift some workloads over to Hadoop within the next six months [64]. While it is
unclear if Wikibon’s sample is representative of the industry at large, the response does seem to
echo the increasing interests by corporations in utilizing technologies beyond conventional
relational data warehouses to manage the rising tide of new “big data” sources they face. Google
Search Trends data back up this sentiment as the popularity of the term “Data Warehouse”
declined while “Hadoop” and “Big Data” rose over the past decade (Figure 16).
Google Trends (2004 - 2014)
120
100
80
60
40
20
0
hadoop
big data
data warehouse
Figure 16 – Google search popularity of “Hadoop” and “Big Data” vs. “Data Warehouse” [65]
63
Platform Leadership in Open Source Software
Architectural Overview
While Hadoop originally referred to the open source implementation of the Google File
System and MapReduce framework, the term “Hadoop” is now used to refer to the collection of
technologies that has coalesced around those two original technologies. A diagram of the various
components (along with some common open source or proprietary implementations) that were
commonly found in the prototypical Hadoop application stacks at the time of writing is presented
in Figure 17.
Figure 17 – Major building blocks within a Hadoop application stack (popular proprietary / open source project fulfilling a
given role in parenthesis).
In the sections below, some of these key components are introduced to provide an
understanding of how these individual components helped Hadoop become the de facto platform
for Big Data. This understanding is necessary as a foundation for discussing the strategies of
different platform competitors and their complementors.
64
Chapter 3 – A Case Study on Hadoop
Distributed Storage
The distributed storage layer within the Hadoop stack is responsible for managing the
reliable and efficient persistence of data managed by the system. As Hadoop was designed to run
on low-cost “commodity” hardware that can be prone to failure, the distributed storage layer is
responsible for providing resilience in the face of hardware failures. It does so by managing
redundant versions of the data across different machines transparently. Due to the fact data
managed by Hadoop tends to be extremely large (i.e. measured in terabytes or petabytes),
Hadoop assumes that it is more efficient to move “computation to the data” rather than the
reverse and provides mechanisms to do this.
The Hadoop Distributed File System (HDFS) was the component that was originally built
to meet the needs of this layer and it remains the most popular component for storage within the
extended Hadoop today. However, other options exist, including MapR’s proprietary Distributed
File System, IBM’s General Parallel File System (GPFS), Amazon’s Simple Storage Service
(S3) and UC Berkleys’ Tachyon [66]. Moreover, many non-Hadoop distributed NoSQL systems
such as Apache Cassandra and MongoDB with their own storage subsystems have also been
adapted to interoperate with the rest of the Hadoop stack.
Job Managers and Coordinators
The role of the Job Manager or Coordinator is to orchestrate the execution of
computation across the many computing nodes within a Hadoop cluster. Originally, Hadoop was
designed to handle only batch-based MapReduce jobs used for “embarrassingly parallel”
(computing problems that are trivial to break apart and parallelize) tasks such as the pageindexing operation required by Apache Nutch. As a result, the original component for managing
that computation was directly integrated into the component within the MapReduce processing
framework itself. This component was known as the Job Tracker.
As more data is deposited within the Hadoop file system, the desire to run different types
of non-MapReduce programming models and interactive workloads has correspondingly
increased. This desire required an ability for the framework to manage computing resources for
these different use cases accordingly. As a consequence, a more sophisticated coordinator,
Apache YARN (“Yet Another Resource Negotiator”) was created to handle different types of
computing workloads [67].
65
Platform Leadership in Open Source Software
While the job manager is arguably one of the most central and critical components within
the Hadoop stack, there is not much competition in this space. YARN’s only notable alternatives
at the time of writing is the open source Mesos framework created by the Berkley’s AMPLab, as
well as the processing framework specific job manager originally integrated into MapReduce.
Distributed Processing Frameworks
The MapReduce distributed processing framework that Cutting replicated in Hadoop was
designed to offer a simple programming model for software developers writing highly
parallelizable programs. Structuring a computational problem so that it can be reliably processed
by a large number of computers in parallel had previously been a complex task. MapReduce
solved this problem by requiring developers to break their algorithms into its two eponymous
steps: “Map” and “Reduce”. The “Map” step partitions the required data into groups and the
“Reduce” step processes the data within that group and summarizes it. For example, in order to
find how many books any given author wrote in a large unsorted library of articles, a map
function can partition the library based on the author’s name and the reduce function can count
how many books are in each partition. MapReduce’s unique value is that it can execute this
process for a library that is spread over thousands of computers and efficiently deliver the result.
Figure 18 illustrates this process conceptually.
Figure 18 – Diagram of basic MapReduce execution, taken directly from Jeff Dean and Sanjay Ghemawat’s original article on
MapReduce [53].
66
Chapter 3 – A Case Study on Hadoop
As long as a given computing algorithm can be structured this way, MapReduce was able
to ensure that it can be reliably distributed across massive clusters of computers. As it turns out,
many complex algorithms can be decomposed into a series of MapReduce steps, making Hadoop
a versatile tool for tackling all sorts of Big Data problems.
However, the single-framework approach in Hadoop is suboptimal for a number of
reasons. Firstly, as MapReduce was implemented as a batch-processing framework, it had
significant overhead and inefficiencies that makes it unusable for interactive end user computing.
Secondly, MapReduce stores all intermediate results back into the Distributed File System
(partially as a means of ensuring failure resilience) which makes the framework inefficient for
algorithms that tended to iterate over the same dataset over and over again. Many popular
machine learning algorithms useful for extracting insight out of “Big Data” falls into this
category of “iterative” algorithms. Finally, MapReduce was an inefficient programming model
for most software developers. The framework forced developers to formulate their problems in
an unintuitive way, which significantly diminished developer productivity [68].
The Hadoop community resolved this third issue of developer efficiency by developing
new abstractions that sat on top of MapReduce. This included engines such as Pig and Hive,
which enabled programmers to develop in PigLatin (a procedural programming language for data
transformation) and SQL respectively. The community also developed libraries such as Mahout,
which offered a repository of ready-made Machine Learning algorithms so that individual
developers did not have to wrestle with MapReduce directly themselves. However, these efforts
did not address the fundamental deficiencies that MapReduce had as a framework for interactive
computing or iterative processing.
Spark, a framework originating from Berkeley’s Algorithms, Machines and People
(AMP) Laboratories attempts to address most of these problems. Originally created as a part of
the AMPLab’s Berkeley Data Analytics Stack (BDAS), Spark was primarily developed by
Berkley researchers independent of the Hadoop community. However, they worked to integrate
their technology into Apache Hadoop and it has since been embraced by the Hadoop community.
The AMPLab submitted Spark as an Apache Incubator project in June of 2013 and it was
accepted as a top level Apache project in February 2014 [69].
67
Platform Leadership in Open Source Software
In addition to Spark, Apache Tez was also created to address the limitations in
computational complexity of the original MapReduce framework. Both Tez and Spark were
influenced by a Microsoft Research paper on a system called Dryad [70]. Although the original
MapReduce was able to process complex algorithms by connecting one job with another, the
framework was designed to handle a relatively simple two-stage processing pipeline with a
single input and output at each stage. Dryad (and therefore Tez and Spark) offers an arbitrary
number of inputs and outputs at each stage, enabling the expression of a complex processing
graph. This fundamental improvement of the processing framework, along with YARN’s
management framework improvements discussed in the previous section, are considered the core
parts of the Apache community’s “Hadoop 2” efforts, which seek to make Hadoop a general
purpose distributed processing framework rather than one used for batch processing [71].
Beyond offering the ability to execute more complex processing graphs, Spark also
introduced “In-Memory” computing concepts to distributed processing through a novel
abstraction known as the “Resilient Distributed Dataset” (RDD). The abstraction allows Spark
to reliably avoid persisting intermediate results back to disk, enabling the framework to execute
iterative workloads orders of magnitude faster than MapReduce or Tez. As a result of this
advantage, Spark has attracted significant attention in both academia and industry. All major
distributions of Hadoop now include Spark. Additionally, a growing set of applications,
scripting engines and libraries have ported their MapReduce algorithms over to Spark. In July of
2014, Cloudera, Databricks, IBM, Intel and MapR announced a partnership to help the Hadoop
community standardize on Spark as the “framework of choice” by porting popular components
such as Hive and Pig to Spark [72]. At the time of writing, Spark appears to be positioned to
succeed MapReduce as the de facto processing framework for Hadoop systems.
Scripting Engines, Libraries and SQL on Hadoop
As mentioned in the previous section, one of MapReduce’s major drawbacks is the fact
that it forces software developers to rethink their algorithms using a framework structured for
efficient processing by computers. In response to this, the Hadoop community created
abstractions in the form of Domain-Specific Languages and libraries to enable developers to be
more effective. For example, the Pig Scripting Engine offers imperative programming language
(“Pig Latin”) that provides developers with operations and data structures useful for
68
Chapter 3 – A Case Study on Hadoop
manipulating structured datasets. Developers are able to write their data transformation
programs using this developer-friendly language and the Pig Scripting engine internally
translates these operations into machine-friendly MapReduce jobs, thereby offering the massive
parallelism and efficiency of Hadoop without imposing the burden of understanding the
associated complexity on developers [73]. Similarly, the Apache Mahout project seeks to
accelerate the development of machine learning programs on Hadoop by offering a library of
common machine learning algorithms that developers can leverage.
Unlike the other layers in the stack mentioned in the previous sections, as it is entirely
possible to have a fully functional Hadoop system to exist without this layer. Therefore, these
components should not be considered part of the core technical platform. However, from the
perspective of evaluating Hadoop as an industry platform, these components are absolutely
critical. Many of these libraries are so commonly used within their respective domains that they
have become the de facto interface into the Hadoop platform. A large number of key
complements that creates value for the Hadoop ecosystem depend on the interfaces presented by
these components. Due to the fact that some complementary applications depend exclusively on
the interfaces presented by these components, these layers actually make the underlying core
platform-level components substitutable.
One notable type of such components are those enabling SQL (Standard Query
Language) connectivity to Hadoop data. As discussed in the previous section on the relation
between Hadoop and the Big Data phenomenon, SQL is the industry standard used to interface
with conventional relational databases and has been in heavy utilization since it was first
commercially implemented in Oracle V2 in 1979 [50]. SQL and the emergence of middleware
standards such as ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity)
have made it possible for a large ecosystem of analytical software. Analytical software vendors
are able to create tools that support a vast variety of databases from different vendors by simply
obeying the SQL standard. The vibrancy of the analytics software market has created a large
number of sophisticated tools for business users, analysts and data scientists to extract insights
out of the data deposited within relational data sources. SQL-on-Hadoop allow these same tools
to be connected to data stored within Hadoop.
69
Platform Leadership in Open Source Software
The ability to use SQL to connect to Hadoop also makes it substantially easier to combine
and integrate data between a traditional relational data store and data stored in Hadoop. This
appeals to both users who seek to gain insight with data split across these two different types of
systems as well as vendors of traditional data warehouses (e.g. IBM, Microsoft, Oracle,
Teradata). Traditional vendors can maintain their positions as the “centers of data gravity” by
allowing Hadoop data to be federated or “managed” through their relational data platforms.
Table 11 enumerates some of the SQL on Hadoop offerings that GigaOm Research evaluated in
2013.
Vendor / Community
Product / Project Name
Cloudera
Impala
Hadapt
Adaptive Analytical Platform
Teradata
SQL-H
EMC Greenplum
HAWQ
Citus Data
Citus DB
Splice Machine
Splice Machine
Apache
Drill, Hive, Stinger
JethroData
JethroData
Concurrent
Lingual
Table 11 - A selection of SQL on Hadoop offerings as identified by GigaOm Research in 2013 [63]
Administration and Management
Hadoop’s approach to scalability and resilience is fundamentally different than the
strategy typically employed in traditional enterprise data centers. In order to offer cost-effective
scalability, Hadoop was built to run on “commodity” hardware that are more prone to failure
than the types of dedicated servers traditionally found within data centers. Moreover, as the
70
Chapter 3 – A Case Study on Hadoop
number of computing nodes within a given cluster increases, the probability that there is a failure
within the system at some point also increases. Given that it was built to operate clusters
containing thousands of computing nodes, Hadoop (and its Google predecessor) was design to
treat “failures as the norm rather than the exception” [54].
This approach fundamentally affects the work of the datacenter operator, who must
become continuously involved by proactively maintaining the health of the cluster. While such a
mode of operation is familiar to operators in internet / cloud services companies such as Yahoo!
and Google, this represents a novel challenge for enterprise IT departments. This challenge is
exacerbated by the fact that the community of developers who contribute to Hadoop tend to be
employed by internet / cloud service companies. As a result, the administrative consoles in the
community delivered-versions of Hadoop were originally designed to their preferences. For
example, Hadoop originally offered only a rudimentary graphical user interface for managing the
basic operations of the cluster, leaving the majority of configuration and management tasks for
scripts, configuration files and API calls. This minimalistic approach was sufficient for the more
technical developer-operators employed by internet / cloud service companies, but enterprise IT
administrators tend to rely on graphical management consoles to simplify their work and have
come to expect this functionality in most software that reside in their datacenters.
Commercial vendors such as MapR and Cloudera have filled this gap with their own
proprietary solutions and use their solution as a means of differentiating their offerings from that
of the free community. A free and open source management console was not created until
commercial vendor Hortonworks incepted the Apache Ambari project as a part of delivering its
own distribution of Hadoop in 2011 [74].
71
Platform Leadership in Open Source Software
Market Overview
In its 2014 research report, the Forrester group characterized the Hadoop market as a
fragmented market where there were “lots of leaders, but none dominate” [62]. The researchers
divided the market into the six major types of players described in Table 12.
Name
Description
Apache Open Source
Users can directly deploy what is made available by the open
source community without engaging commercial firms.
Pure play Hadoop
A number of start-ups have emerged to profit with a “focus on
Vendors
developing, supporting and marketing unique Hadoop distributions,
add-on innovations and services”. The “Big 3” of this group are
Cloudera, Hortonworks and MapR Technologies.
Enterprise Software
Enterprise software vendors such as IBM, Oracle, Pivotal, SAP and
Vendors
Teradata offer Hadoop as a part of their own data management
solutions. They do so either by creating their own Hadoop
distributions or by supporting an existing distribution through
partnership.
Hadoop in the Cloud
Cloud computing vendors such as Amazon and Microsoft have
begun to offer “on-demand” Hadoop services. This allows
enterprises to purchase Hadoop as a service and scale their Hadoop
clusters up or down at a moment’s notice.
Big Data Solution
Solution providers are system integrators that design solutions
Providers
using technologies from others within the Hadoop ecosystem.
Hadoop Accessories
Forrester uses this group to refer to the tools and services that
complement the core Hadoop platform.
Table 12 – Breakdown of Hadoop-market according to Forrester Research [62]
72
Chapter 3 – A Case Study on Hadoop
Amongst these six categories, the group that receives the most media attention is the pure
play Hadoop Vendors. This group is led by three startups that have combined to raise over $1.6
billion USD in private equity and venture capital in the five years between March of 2009 and
July 2014 (Figure 19). It is worth noting that the $1.6 billion was only the amount of funding
that has been put into the three firms. The actual valuation of the three companies is significantly
more. For example, Intel’s $740 million investment into Cloudera in March of 2014 was
exchanged for 18% of Cloudera’s equity, effectively valuing Cloudera at over $4 billion.
Cloudera, Hortonworks and MapR were all founded with the intent of bringing the
Hadoop platform from the realm of specialized internet companies to the IT departments of
enterprises. As such, these firms should be considered platform providers in the topology of
Eisenmann, Parker and Van Alstyne.
Millions
Total Funding of Pureplay Hadoop Vendors (in Millions of USD)
$1,800.00
$1,600.00
$1,400.00
$1,200.00
$1,000.00
$800.00
$600.00
$400.00
$200.00
$-
Cloudera
Hortonworks
MapR Technologies
Total
Figure 19 – Cumulative Investments in Pureplay Hadoop Vendors according to CrunchBase in 2014 [75]
Cloudera
Hortonworks
MapR Technologies
Total
2009
$11
$$9
$20
2010
$36
$$9
$45
2011
$76
$48
$29
$153
2012
$141
$48
$29
$218
2013
$141
$98
$64
$303
2014
$1,201
$248
$174
$1,623
Table 13 – Cumulative Investments in Pureplay Hadoop Vendors according to CrunchBase in 2014 (in Millions of USD) [75]
73
Platform Leadership in Open Source Software
While pure play Hadoop vendors tend to be the focus of analyst attention, they currently
trail significantly behind enterprise software vendors in capturing value from the Big Data
market. According to estimates provided by the research firm Wikibon, the combined revenue of
the three major Hadoop vendors totaled approximately $163 million USD in 2013. As all three
firms are privately held, Wikibon “triangulated” these numbers through discussions with various
industry observers, company insiders and other sources. Consequently, these numbers must be
used with caution. In fact, the November 2014 Form S-1 provided by Hortonworks as part of its
initial public offering (IPO) application reveals that the gross billings of the company was
substantially less than what Wikibon had estimated [76]. Nevertheless, these numbers are useful
for illustrating the relative scale of the “Big 3” Hadoop pure play vendors compared to the scale
of enterprise software vendors within the Big Data space (Figure 20)
.
Big Data Related Software and Services Revenue 2013
$1,000.00
$900.00
$800.00
$700.00
$600.00
$500.00
$400.00
Big Data Services Revenue
$300.00
Big Data Software Revenue
$200.00
$100.00
$-
Figure 20 - Big Data-related Software and Services Revenue of the Top 3 Enterprise Software Firms vs. Pureplay Hadoop
Vendors in millions of USD (from Wikibon, processed data in Table 18 of Appendix) [77]
While the bulk of big data revenue for enterprise software firms comes from their
traditional data warehousing products based on relational database technology, enterprise
vendors are also making sizable investments into the Hadoop world. This investment is
74
Chapter 3 – A Case Study on Hadoop
necessary as the market interest in Hadoop increases and as the attention of their customers shift
towards solving the types of problems that Hadoop is well-equipped to solve (i.e. unstructured or
semi-structured data, large data of unknown value and usage). IBM created its own Hadoop
distribution in 2011 as a part of its IBM BigInsights analytical offering and have since built out a
number of proprietary tools and technologies to work on top of Hadoop. EMC-spinoff Pivotal
created a comparable offering in its Pivotal HD product line. Others like Microsoft, HP, Oracle
and Teradata have partnership arrangements with Cloudera, Hortonworks and MapR to bundle,
resell or redistribute their Hadoop distributions. This creates an interesting tension for enterprise
software vendors as they need to rationalize and position Hadoop alongside their existing
offerings. With the notable exception of Pivotal, the majority of these firms position Hadoop as
a complementary component within a larger big data platform, rather than a platform itself
(Table 14).
Vendor
Sample External Positioning of Hadoop
IBM
“New data management and analytic technologies are being implemented to
complement rather than replace traditional approaches to data management and
analytics. Thus Apache Hadoop does not replace the data warehouse and NoSQL
databases do not replace transactional relational databases” [78]
SAP
“SAP customers can incorporate enterprise Hadoop as a complement within a data
architecture that includes SAP HANA and SAP BusinessObjects enabling a broad
range of new analytic application” [79]
Oracle
“New big data technologies, such as Hadoop and Oracle NoSQL database, run
alongside your Oracle data warehouse to deliver business value and address your
big data requirements” [80]
Teradata
“Teradata Unified Data Architecture is the only truly integrated analytics solution
that unifies multiple technologies into a cohesive and transparent architecture that
leverages the best-of-breed complementary value of Teradata, Teradata Aster and
open source Hadoop” [81]
75
Platform Leadership in Open Source Software
Microsoft
"The Microsoft Analytics Platform System is a no-compromise modern
data warehouse solution that seamlessly combines a best-in-class
relational database management system, in-memory technologies,
Hadoop, and cloud integration in a turnkey package built for Big Data
Analytics” [82]
Table 14 – Sample Hadoop positioning statements by Enterprise Software vendors
In addition to being some of their most formidable inter-network and intra-network
platform competitors, enterprise software vendors are also some of the most valuable partners for
pure play Hadoop vendors. Mega vendors such as IBM, SAP and Oracle are frequently the
providers of the products and services that are critical complements for Hadoop. In fact, every
single enterprise software vendor listed above possess partnership arrangements with either
MapR, Cloudera or Hortonworks (Table 15).
Cloudera
Hortonworks
MapR
IBM
X
X
X
SAP
X
X
X
Oracle
X
X
Teradata
X
X
X
Microsoft
X
X
X
Table 15 - Partnership matrix between pure play vendors and enterprise software vendors [83]–[85]
Of course, enterprise software vendors are not the only providers of key complements for
the Hadoop platform. “Hadoop in the Cloud” providers as well as independent software vendors
(ISVs) creating “Hadoop Accessories” also play a critical role in completing the ecosystem. In
the former category, two vendors are especially worth highlighting.
76
Chapter 3 – A Case Study on Hadoop
According to the 2014 version of the Gartner group’s Magic Quadrant for Cloud
Infrastructure as a Service, Amazon is the leading provider of cloud Infrastructure-as-a-Service,
leading its only competitor (Microsoft) within the “leaders” quadrant by a significant margin in
both “Completeness of Vision” and “Ability to Execute” [86]. Amazon built upon this leadership
position to establish a significant presence in the Hadoop market with its Hadoop-as-a-Service
(HaaS) offering, Elastic MapReduce (EMR). From a technical perspective, EMR differs
substantially than the canonical Apache-based Hadoop stack in that key components (e.g.
distributed storage layer) is substituted with Amazon’s proprietary web-services (e.g. Amazon
S3). According to a study by Accenture in 2013, cloud-delivered Hadoop services is superior to
on-premise “bare-metal” deployments in price, performance as well as flexibility [87]. As a
result of these advantages, adoption of cloud-delivered Hadoop services, and consequently,
Amazon’s influence over the Hadoop market, is expected to grow.
The other “Hadoop in the Cloud” vendor worth mentioning is Berkeley start up
Databricks. Databricks offers a cloud-delivered version of its Hadoop platform variant, featuring
the open source Apache Spark technology. As discussed in the architectural overview section on
distributed processing frameworks, key Hadoop players such as Cloudera, MapR and IBM have
embraced Apache Spark as a likely successor for MapReduce and the framework is rapidly
becoming the standard execution engine for Hadoop complements. At present, the firm’s only
product offering (Databricks Cloud) is a nascent and nominal entry in the emerging Hadoop-asa-Service market. However, beyond employing Spark creator Matei Zaharia as its CTO,
Databricks also employs 30% of all approved “committers” for Apache Spark, the most of any
organization (Figure 21). This unique competency affords Databricks a disproportionately large
reach and influence over the Hadoop ecosystem relative to its scale and size (Figure 21).
77
Platform Leadership in Open Source Software
Committers to Apache Spark
Databricks
UC Berkeley
Yahoo!
Quantifind
Mxit
ClearStory Data
Groupon
National University of Singapore
Webtrends
Bizo
Alibaba
Imaginea, Pramati, Databricks
Figure 21 - Official Committers to Apache Spark by Organization [88]
Strategic Factors affecting Platform Leadership within the Hadoop Ecosystem
In the following section, an analysis of the critical factors impacting a firm’s ability to
direct the trajectory of the Hadoop ecosystem is presented using the framework outlined in
Chapter 2. This analysis will focus on the three “pure play” Hadoop vendors as their long term
success is most directly affected by their ability to harness the growth of the Hadoop platform.
It is worth reiterating at this point that the objective of this thesis is not to assess which
Hadoop firm is most likely to succeed. Instead, this analysis is intended to identify the firms’
assessments of the market forces and how their assessments affect their strategies. Given the
rapid changes that are occurring daily within the Hadoop ecosystem, it is entirely possible that a
firm’s perspective of the market will have changed by the time this thesis is published or
consumed, rendering the detailed analysis content outdated. However, the observed behavior of
each of the firms is consistent with its perspective at the time of writing, so the analysis is
nevertheless useful for illustrating how a firm’s assessment of these market forces materially
affects its strategies.
78
Chapter 3 – A Case Study on Hadoop
Rivalry - Inter-network vs. Intra-network Competition
One of the fundamental questions that a pure play vendor must answer for itself is
whether its primary competition are other Hadoop providers (intra-network competition) or
alternative platforms (inter-network competition) such as those offered by Teradata, Oracle or
IBM. This assessment affects all aspects of the business’s strategy including how they position
their products to the market, the technical areas within the platform that they choose to invest in,
the partnerships that they choose to pursue and their interactions with the open source
community.
Current market share leaders Cloudera and Hortonworks appear to differ in their
assessments of which competitive battle is most critical to long-term success. In October 2013,
Cloudera began marketing its commercial Hadoop distribution as an “Enterprise Data Hub” and
began articulating a vision that describes its Hadoop-based platform as a “unified data
management platform” capable of addressing all data management needs of the enterprise [89].
The firm posits that Hadoop’s superior cost-effectiveness and flexibility makes it the natural
“center of data centers as the first place data goes when it enters the enterprise, rather than at the
side of the data center to solve a few, ancillary problems” [90].
While Cloudera has since been careful to clarify that it does not intend to position its
product as an immediate alternative to specialized solutions like the traditional Enterprise Data
Warehouses that its large partners offer, it has also been clear on its perspective that “workloads
that belong in high-end enterprise data warehousing systems today, won’t in the future – and
even high-performance, interactive analytic workloads will run in Hadoop” [91]. Cloudera
describes its “Enterprise Data Hub” distribution of Hadoop as a complete data management
platform for companies with a multitude of data management needs, and not as a point solution
used to fill the gaps left by traditional relational technologies. In early 2014, Cloudera’s director
of marketing Alan Saldich was quoted as saying that Cloudera has “many, many customers that
are substituting an enterprise data hub built on Hadoop for incremental purchases of a whole
range of data management infrastructure, including relational databases, enterprise data
warehouses, storage, and mainframes". In the same article, he also asserted that Cloudera’s
customers are not comparing its product to alternatives from Hadoop-vendors like Hortonworks,
but rather to solutions from IBM or Teradata [92]. In other words, Cloudera’s ambitions are not
79
Platform Leadership in Open Source Software
to become the leading provider of Hadoop-distributions but to become a leading provider of data
platforms, Hadoop or otherwise (inter-network competition).
Implications on Lever 3 (External Relationships)
Cloudera’s view of Hadoop as a platform that can eventually displace traditional data
platforms like the relational database or the enterprise data warehouse clearly differs from the
public position of its rival Hortonworks. Rob Bearden, CEO of Hortonworks, describes Hadoop
as a “rock solid” platform for processing unstructured data and expressed no desire to “reinvent
the wheel” and compete with relational database vendors such as IBM, Oracle or Teradata.
Bearden and Hortonworks espouse a “coexistence” view of Hadoop that more narrowly positions
the technology as a complement to traditional data management technologies. Hortonworks has
correspondingly invested in having its distribution “adopted and integrated seamlessly into
[these] environments”. Bearden argues that Hadoop can extend traditional data platforms such
as Teradata and “let it manage a much bigger data set, a 10 to 20 times bigger data set and have
Hadoop as an extension of its architecture” [93]. In other words, Hortonworks does not see a
need to position Hadoop as a viable alternative to traditional data management platforms but
rather focuses on competing effectively against intra-network Hadoop competitors by offering
superior integration into traditional data management environments.
The different focus on inter-network and intra-network competition of the different firms
impacts the ease with which the two firms are able to engage with other members within the
ecosystem. In some respects, Hortonwork’s positioning of Hadoop as a complement of existing
data platforms is more compatible with the perspectives of larger enterprise software vendors and
has likely assisted the firm in striking lucrative reseller arrangements with some of these giants.
HP, Microsoft, SAP and Teradata all have “strategic reseller” arrangements with Hortonworks
[91]. In 2013, Hortonwork’s arrangement with Microsoft was responsible for over 55% of
Hortonwork’s total revenue [76] and the Redmond giant actually embeds a variant of the
Hortonworks Data Platform in its HDInsights offering on its Azure cloud platform. Similarly, the
Teradata Portfolio of Hadoop redistributes a variant of Hortonwork’s HDP branded as “Teradata
Open Distribution for Hadoop” [94].
One could argue the nature of Cloudera’s relationship with these enterprise vendors are of
lesser intensities. For example, while Cloudera recently announced a significant go-to-market
80
Chapter 3 – A Case Study on Hadoop
partnership with Teradata (as of 2014), Teradata still does not pay Cloudera for its technologies
in the way it does to Hortonworks; Teradata trails only Microsoft and Yahoo! in terms of
contribution to Hortonworks’ top line [76]. While Cloudera’s distribution is supported and has
been sold by HP as part of its AppSystem solution since 2012, HP made a $50 million USD
investment in Hortonworks in 2014 that appears to reflect its partnership preference [95].
However, some industry analysts have observed that enterprise software vendors are finding that
it is advantageous “to be polygamous in their relationships with Hadoop distro providers” [96].
Consequently, it is unclear if Hortonworks’ positional advantage in collaborating with enterprise
technology vendors is material or sustainable.
Implications on Lever 1 (Scope of the Firm)
While Cloudera’s intention of competing with platforms beyond Hadoop may have
impeded its collaboration with enterprise partners, this ambition appears to provide the firm with
a vision for the future of the Hadoop platform that has advanced the platform forward. This
vision has also affected the firm’s “Scope of the Firm” (Lever 1) decision making.
One example of this is Cloudera’s decision to invest in the creation of a “Fast SQL”
engine for Hadoop called Impala. In a private interview completed for this thesis, the firm’s
chairman Mike Olson shared that it was obvious “that many [existing proprietary SQL engines]
would eventually be ported to Hadoop, because Hadoop matters” [97]. Given such a
perspective, a firm focused on competing effectively with other Hadoop vendors would likely
choose to partner with an incumbent SQL vendor, rather than build yet another competitor in the
crowded space and compete with its own SQL entry. However, as SQL was a crucial interface
that connects a significant number of pre-existing complementary applications, Cloudera
believed that it was crucial for a fast SQL engine to be part of Hadoop’s “open core”, rather than
be a proprietary component external to the core platform. As a result, the firm invested into
developing Impala as a Cloudera-governed open source project. The firm believed that the
inclusion of Impala as part of Cloudera’s distribution of Hadoop was necessary to bolster its
competitiveness against traditional data management solutions.
Cloudera’s decision reset expectations of what should be available “out-of-the-box”
within a Hadoop distribution and sparked the development of additional fast-SQL-in-Hadoop
projects like Apache Stinger, which have further bolstered the viability of the Hadoop platform.
81
Platform Leadership in Open Source Software
Olson cites Cloudera’s integration of Apache Solr-based search and its embrace of Apache Spark
as other examples of the firm’s continued thought leadership in making Hadoop the leading bigdata platform. One can argue that Cloudera’s focus on competing effectively against alternative
platforms has caused it to push the boundaries of the Hadoop platform extent, benefiting the
growth of the ecosystem as a whole.
While Hortonworks and MapR have also developed a number of significant new
technologies that improve Hadoop’s viability, they have been less aggressive in changing the
platform extent of Hadoop by introducing new capabilities to the platform. Rather, the firms
have focused their efforts on offering improved implementations of capabilities that were already
available in the market. This differing investment philosophy reflect the focus of the pair on
competing effectively within the Hadoop ecosystem rather than competing beyond it. For
example, while Hortonworks has been responsible for the engineering horsepower behind
substantial projects such as Ambari and Stinger, these projects were not pioneering new grounds
in offering new functionality to the Hadoop market, but rather Apache-community
implementations of capabilities that were already available to the ecosystem at large through
proprietary extensions developed by companies such as Pivotal and Cloudera. Similarly, MapR
has focused its proprietary engineering effort on offering superior implementations of core
components such as the distributed file system, as it strives to become the most enterprise-ready
distribution of Hadoop available. While this allows MapR to differentiate itself from the likes of
Cloudera and Hortonworks, its investments in this area also do not introduce new platform
functionality that equip Hadoop to compete against alternative platforms.
As the largest pure play vendor in the Hadoop market and the “incumbent leader” due to
its first-mover advantage, Cloudera’s growth is unlikely to come at the expense of its intranetwork rivals [77]. It is unsurprising that its search for growth has led it to look to the broader
data management market and engage in inter-network competition. Conversely, MapR and
Hortonworks recognize that maximizing the growth potential of the Hadoop market require them
to capture a greater portion of the market. Consequently, it is also natural for these firms to focus
on intra-network competition. Generalizing from this case-study, one can infer that a platform
firm’s focus on inter-network vs. intra-network competition is heavily influenced by the extent to
which it is currently positioned to capture the growth in the platform. If a firm is already
82
Chapter 3 – A Case Study on Hadoop
positioned to capture the majority of growth in a given platform, it will focus on inter-network
competition and winning against alternative platforms. Firms that are not the primary
beneficiaries of a platform’s growth will instead focus on growing their market share and on
intra-network competition.
Suppliers - Securing the Upstream Value Chain
As briefly mentioned in the market overview section, Intel made a substantial investment
of $740 billion USD in Cloudera at the beginning of 2014, acquiring 18% of the company [98].
This investment was not only noteworthy in terms of its magnitude, but also because it
represented a rapid and surprising shift in Intel’s approach to the Hadoop market. Intel had
entered the Hadoop market developing and bringing to market its own Intel Distribution of
Hadoop that was optimized for its microprocessors only a year earlier [99]. In a private
interview conducted for the purpose for this thesis, Intel’s Big Data GM Ron Kasabian explained
Intel’s initial motivation for getting into the Hadoop market stemmed from a desire to accelerate
the adoption of Hadoop in the enterprise. It believed it could do so by introducing the
“enterprise-hardening” features to the platform that Intel believed were critical for mass adoption
amongst enterprises.
Intel understood the challenges and opportunities of Big Data itself as it faced an
explosion of data in its own operations; Kasabian shared that Intel’s own factories generate as
much as “five terabytes of data every hour”. With an internal estimate of 94% market share in
the datacenter microprocessor market, Intel believed that it would be one of the primary
beneficiaries of mass enterprise adoption of the computationally intensive Hadoop technology.
At the time of Intel’s investment into the Hadoop space in early 2011, there were no industry
leaders within the Hadoop ecosystem which Intel felt was equipped to drive adoption of Hadoop
within the enterprise. Intel decided to invest in the technology, initially believing that it could
not only accelerate the adoption of Hadoop, but also that it could become the market leader in the
space given the firm’s unique complementary assets. The firm believed that by optimizing
Hadoop for its own microprocessors, it could not only reinforce its dominant position in the
growing Hadoop market, but also compete effectively with other Hadoop vendors on the virtues
of superior performance.
83
Platform Leadership in Open Source Software
Despite some initial market success, particularly within China where Intel’s distribution
of Hadoop was number one in market share, Intel decided that its Hadoop objectives were better
met by investing into Cloudera rather than continuing with its own distribution. Beyond taking
an 18% stake in the pure play vendor, Intel also agreed to cease development of its Intel
Distribution and have its engineers bring its optimizations into Cloudera’s distribution.
Relation to Lever 2 (Product Technology)
According to Kasabian, one of the reasons that Intel decided to abandon its own
distribution in favor of partnering with a Hadoop pure play vendor is the fact that it wanted to
drive its optimizations back into the core of the Apache governed projects in order to affect the
ecosystem in the manner it desired. Although it had a number of Apache-approved committers
on staff, Intel had far fewer committers than either Hortonworks or Cloudera. By partnering with
a pure play Hadoop vendor, Intel was much more likely to get its patches contributed back into
the Apache core projects and adopted by the broader community, including other Hadoop
vendors and their customers.
# of Committers to Hadoop-related Apache Projects by Company
90
80
70
Zookeeper
60
Tez
50
Spark
40
Pig
30
Hive
20
Hbase
10
Hadoop
0
Accumulo
Figure 22 – Hadoop Contributors by Organization - Hortonworks and Cloudera employ the most Hadoop committers out of all
Hadoop - Yahoo and Facebook are Hadoop users and not vendors – Data extracted and analyzed from various projects website
at www.apache.org.
84
Chapter 3 – A Case Study on Hadoop
Intel’s assessment reflects the criticality of securing unique access to the critical
resources (i.e. influential members of the open source community) in ensuring that a firm is able
to influence the architectural trajectory of an open source platform. While the majority of open
source projects operate as meritocracies with a distributed center of authority, the truth is that a
small subset of contributors (“committers” in the vernacular of the Apache Foundation) are
responsible for the majority of technical decisions within a project at any given time. This core
group also tend to remain relatively stable for a given project. If a firm is to be a leader of an
open source platform, it must have access to such individuals for key projects; this access is a
prerequisite for wielding to “Lever 2” (product technology) of platform leadership.
Beyond gaining access to the committers that affect technical decision making, a software
firm must also staff itself with individuals who deeply understand the technology used by its
customers. Moreover, firms must convince the market at large that it has done so. In a market
such as enterprise software, the perceived pedigree of a firm’s engineering staff can be a major
consideration in the deliberate and scrutinized purchasing process. As a consequence, one can
argue that the competitive advantage of employing technical contributors in the community
stems as much from its marketing value as the actual engineering capability gained.
Being active and visible in the open source community is one way for open source firms
to convince customers of its competency in a particular open source technology. Cloudera and
MapR have engaged in very public debates regarding which firm has contributed more to the
open source development of Hadoop for this reason [100], [101]. In fact, when Hortonworks
was spun out of Yahoo! in 2011, one of its primary marketing messages was that it employed
more experienced Hadoop contributors by virtue of its Yahoo! lineage than any other company
and thus, was the best equipped to support Hadoop in the wild [102]. Open source platform
contenders in such markets also tend to employ highly visible community leaders as a part of
their management team for a similar reason; MapR, Cloudera and Hortonworks all employ
highly visible members of the Hadoop community as part of their senior leadership teams.
Collaboration with Open Source Community
Like all open source vendors, pure play Hadoop vendors face a number of options for
managing new intellectual property. There is an expectation for open source software vendors to
contribute some of their innovations back to the community. Such contributions can be
85
Platform Leadership in Open Source Software
strategically advantageous to the firm, as they can be means of influencing the trajectory of the
technology in the firm’s favor. However, it may also make sense for a firm to withhold an
innovation for itself as a means of differentiating its platform variant. If a firm opts to develop in
open source, it also needs to make a conscious decision of doing so under the governance of an
independent community such as the Apache Software Foundation, or to drive the process itself.
Within the Hadoop market, Hortonworks is the only vendor that has publically committed
to maintaining a 100% open source development model. The firm does not only do all of its
development in open source but also commits to developing “exclusively via the Apache
Software Foundation process” [103]. While this decision has implications on the company’s
business model (the company has no unique product to license and therefore, can only be a
strictly services company), it does help the firm win the minds and wallets of its customers in a
number of ways. The company’s contributions and commitment to open development create
considerable brand equity for Hortonworks within the Hadoop community itself, and this equity
can translate into influence within the community and credibility with customers in the
marketplace. The firm also heavily markets the danger of vendor lock-in that can occur with a
Hadoop distribution that is not fully open and capitalizes on this message by publicizing its
unique position as the only firm at scale to sell enterprise support for a truly open distribution
[104].
Although developing purely in the Apache model allows Hortonworks to differentiate
itself from the other vendors, it also means that the firm is limited with regards to how it can
influence the ecosystem to its exclusive benefit. Despite the fact that Hortonworks is the
plurality leader in terms of employed Apache committers in Hadoop-related projects, its
workforce still represents only a fraction of the community. Therefore, Hortonworks cannot
make technological decisions for the platform unilaterally. Moreover, while Hortonworks shares
all of its technology with its competitors, they do not necessarily reciprocate. Consequently,
Hortonworks may find itself occasionally trailing its competitors when it comes to the features
and functions that are included with its platform variant.
Unlike Hortonworks, MapR and Cloudera engage in both proprietary and open source
development. Both firms employ an “Open-Core” model, offering a for-profit commercial
product by extending open source technologies with proprietary extensions. Cloudera’s Mike
86
Chapter 3 – A Case Study on Hadoop
Olson and MapR’s John Schroeder have both written public articles explaining why the
development and possession of proprietary intellectual property are necessary for creating a
sustainable businesses that their enterprise customers can rely on [3], [105].
Despite asserting that the possession of proprietary intellectual property is a necessity for
sustained profitability, both Cloudera and Hortonworks still engage in open source development
for some of their new technology initiatives. This relates to the observation made at the
beginning of this thesis, which is that establishing new proprietary platform standards has
become tremendously difficult, and open source development is a tactic that can be deployed to
accelerate industry adoption. If having a standard implementation or if sharing engineering
resources across the entire ecosystem is in the interest of the individual vendors, then it is
typically best for that project to be governed by an independent authority such as the Apache
Software Foundation. In Hadoop, an example of such a type of technology would be the
distributed processing framework itself. Processing frameworks such as Tez or Spark is
simultaneously too complex for an individual firm to develop and too important to the ecosystem
to be fragmented by proprietary development. The firms are better off collectively contributing
to the improvement of this unifying platform component than to risk fracturing the ecosystem in
the hopes of differentiation.
If a vendor believes that a given innovation is likely to help differentiate its platform
variant, then it may prefer to keep the technology proprietary and its source closed. This allows
the company to differentiate its solution and maintain control over its technology. However, if
this differentiating technology occurs in an interface component that sit between the platform and
a type of complement, the firm may need to develop it in open source in order to encourage
broader adoption. In such a scenario, the firm may opt to do so without involving an
independent authority such as the Apache Software Foundation. This allows the firm to maintain
maximum control over the project while avoiding the stigma of being a proprietary or closed
technology. Of course, such a project structure is unlikely to benefit from the resource pooling
of an independently governed project as competing platform vendors are unlikely to contribute.
Cloudera’s Impala is a prime example of such a project.
Figure 23 attempts to summarize the considerations for development model selection
discussed above into a simple decision tree for a given area of innovation.
87
Platform Leadership in Open Source Software
Figure 23 – Proposed decision-tree for selecting between a proprietary, sponsored open source and community governed open
source model for a given innovation (original creation).
88
Chapter 3 – A Case Study on Hadoop
Complementors - Identifying and Securing Critical Complements
As mentioned in Chapter 2, both proprietary and open source platform contenders need to
be proactive in managing critical platform complements. However, open source platform
vendors face a unique challenge in that a community organization such as the Apache Software
Foundation may act as a broker for connecting the platform to its complements. In the topology
of Eisenmann, Parker and Van Alstyne, the Apache organization effectively acts as the platform
sponsor. As a consequence of this, open source vendors cannot use some of the levers afforded to
proprietary firms (such as developing exclusive interfaces with Lever 1) for securing unique
complements to the open platform. Instead, the firms must rely on alternative techniques, such
as the development of partnership arrangements or the introduction of proprietary interface
components to regain that leverage.
Partnership Programs
All three pure play Hadoop vendors boast robust partnership programs for key
complements. For Hadoop, one primary type of complements that enhances the value of the
platform are applications created by independent software vendors (ISVs) that specialize Hadoop
to a specific market or function. As all three firms shares a significant number of public
interfaces governed by the Apache Software Foundation process, applications that work on one
vendor’s Hadoop distribution tend to work also work on another’s. However, in the Enterprise
Software market, “officially supported” software and “technical compatible” components are
hugely differentiated and most enterprise information technology departments are only willing to
adopt the former class of software. Consequently, all three firms offer partnership or
certification programs in order to assure potential customers that solutions created by
independent software vendors are fully supported on their platforms.
Given the common architectural foundation and interfaces for the three vendors, it is
difficult to curate a fully differentiated partner ecosystem on the basis of applications alone; a
software vendor that has built software for Hadoop face very low barriers for multi-homing
across the different distributions. Table 20 of the appendix shows a composite matrix listing the
independent software vendors and technology partners that Cloudera, Hortonworks and MapR
claim on their respective company websites as of November 2014. The three firms do not only
boast an extremely similar number of such partners (Cloudera: 164, Hortonworks: 156 and
89
Platform Leadership in Open Source Software
MapR: 159), but they also have a substantial number of partners in common. The majority of
Hortonworks and Cloudera’s ISV partners have a relationship with at least one of the other pure
play Hadoop vendors.
Non-exclusive Partnership
76
Exclusive Partnership
66
100
88
90
59
Cloudera
Hortonworks
MapR
Figure 24 - Analysis of Exclusivity of Partnership Arrangements for Hadoop ISVs
Given that it is difficult to differentiate a given open source platform variant from its
intra-network rivals based on the primary types of platform complements (i.e. applications),
aspiring platform vendors may need to look to attract other types of complements for
differentiation. Firms may form different opinions about which types of complements are most
valuable beyond the primary complement types.
In a private interview, Mike Olson shared that Cloudera actively pursued the Intel
partnership because it recognized the unique competitive advantage offered by visibility into
Intel’s roadmap and access to Intel’s unique engineering talent [97]. Olson believed that
customers would greatly value the superior performance that an Intel-optimized Hadoop
distribution would hypothetically offer. Ron Kasabian of Intel later corroborated by stating that
one of the primary reasons that the Cloudera leadership team appeared to appreciate the value
proposition that Intel was bringing to the table more than its competitors. This anecdote
illustrates the different emphasis that each firm may place on different complement types.
90
Chapter 3 – A Case Study on Hadoop
Buyers - Controlling the Path to the Customer
Mediating the Purchasing Process of Complements
Enterprise software products are inherently complex systems. New products brought
onto a customer’s landscape must integrate with numerous systems that already reside there.
These existing systems are used, administrated and developed by different individuals and
organizations using different technologies from different eras. Consequently, implementations of
enterprises software systems are often expensive and lengthy endeavors that may require
millions of dollars and hundreds of man-years to complete. As a result of the significant
investment that they represent, the purchasing process of enterprise software systems also tend to
be long and complicated. Enterprises often enlist the help of multiple parties, including
consultants like IBM or Accenture to help them make the best possible decision when selecting
their vendors.
The complexity of this purchasing process is simultaneously an opportunity and a
challenge for Hadoop vendors. As mentioned in the corresponding section in Chapter 2, aspiring
open source platform contenders can differentiate their platform by mediating the purchase
process of complements for customers. Android platform providers attempt to do this by
providing electronic application marketplaces in order to simplify the user-driven acquisition
process of complementary applications for their platforms. The platform providers’ involvement
in complement delivery also provide them with a channel to influence and govern the behavior
of complement creators. Unfortunately, due to the more elaborate purchasing process of
enterprise software, Hadoop vendors cannot take complete ownership of the application
purchasing process in the manner that mobile platform vendors have attempted to. However,
Hadoop vendors still attempt to actively participate in that process to help expedite it and to exert
influence. One specific way they attempt to do that is through the creation of partnership or
certification programs for their platform.
As mentioned in the previous section on securing critical complements, Cloudera,
Hortonworks and MapR all offer programs to help assure the customer that a given complement
provider’s product is compatible with their platform variants. However, certification programs
are also intended to serve a few additional purposes. The programs are also intended to simplify
the application selection process by helping customers identifying the complement vendors
91
Platform Leadership in Open Source Software
available to them. Cloudera describes this intent on their website in the following manner – “The
Cloudera Certified Technology program is designed to make choosing the right technology
easier. When you see the Cloudera Certified Technology logo, you can trust that the product has
been tested and validated to work with CDH, our 100% open source and enterprise-ready
distribution of Apache Hadoop and related projects.” [84]. To this end, each of the firms
dedicate prominent sections of their company websites to help customers find potential
complement vendors.
Beyond helping customers identify the right partners, the certification process also
provide an opportunity for aspiring platform contenders to exert influence over the behavior of
complement creators. For example. Hortonworks explains that technology certified through its
partnership program “are reviewed for architectural best practices”, while Cloudera states that it
verifies that its partners “comply with Cloudera development guidelines for integration with
Hadoop”. These review processes give the firms an opportunity to guide a partner organization
towards integrating with platform components and interfaces in a manner favorable to them. For
example, Hortonworks and Cloudera offer very different administrative environments for their
platform variants, with Cloudera offering its proprietary Cloudera Enterprise Manager and
Hortonworks offering a similar environment in Apache Ambari. Though neither firms impose
this today, it would be possible and reasonable for the firms to require that a complement
provider integrate into their specific administration consoles in order to achieve certification. As
complement producers are often smaller vendors that depend on the endorsement of the platform
providers to reach potential clients, the certification process acts as a powerful bargaining chip
for the platform contenders to influence their activities. Even enterprise software vendors which
exceed the pure play Hadoop vendors in scope and scale lack the expertise and credibility of the
pure play vendors within the Hadoop market and may need to look to these certification
programs to gain credibility with their customers. This offers smaller vendors additional
leverage to bargain against their powerful competitors.
Beyond participating in the purchasing process of complements, Hadoop vendors also
seek to influence the purchasing process for the platform itself by partnering with key
stakeholders of the purchasing process. In enterprise software, major influencers in the
purchasing process include system integrators and IT consultancies such as Accenture, Infosys
92
Chapter 3 – A Case Study on Hadoop
and IBM Global Services as well full-stack mega vendors such as Oracle, IBM and SAP. In
enterprise software, it is not uncommon for some of these firms to become so embedded within
the operations of a large enterprise that the endorsement and approval of these firms can
determine whether or not a smaller vendor will be considered for a deal. All three of the existing
Hadoop pure play vendors recognize this and have partnered with these firms as a means of
ensuring that their platform variants are considered in the selection process. One may infer from
the prior discussion on Hortonworks’ close product and reseller partnerships with enterprise
mega vendors that it holds a distinct advantage over its competitors in this regard. However, due
to the organizational separation between the product and services organizations that exist in most
of these vendors, the impact of those partnerships on the vendor selection appears to be minimal.
Cloudera
Amazon
Google
Microsoft
N/A
N/A
CDH Available via Azure
Marketplace
Hortonworks
N/A
N/A
Directly integrated into
HDInsights
MapR
Directly available as
Exclusive Distribution
EMR Option
on GCE
N/A
Table 16 - Partnerships between pure play Hadoop vendors and leading cloud IaaS vendors – sourced from company websites
One interesting type of partners for Hadoop platform vendors are cloud Infrastructure-asa-Service vendors such as Amazon, Google and Microsoft. These three software giants each
offer their own Big Data solutions as a Service-based offering (Elastic MapReduce for Amazon,
BigQuery for Google and Azure HDInsights for Microsoft) that compete with the offerings of the
pure play Hadoop vendors. The three software giants possess a significant go-to-market
advantage over the pure play vendors as they are able to offer both the software as well as the
underlying hardware infrastructure in a single package, significantly simplifying the overall
acquisition process for a big data solution for customers. Pure play vendors have attempted to
nullify this advantage by integrating with some of these cloud vendors; Table 16 enumerates
93
Platform Leadership in Open Source Software
some of the integrations that have been pursued. Of the three pure play vendors, Cloudera is
arguably the least integrated into the offerings of these leading cloud vendors, with its only
integration point being the availability of its solution via the solution marketplace offered by
Microsoft’s Azure. According to some industry observers, this is the reason that the company
has pursued and heavily marketed its partnerships with some of the smaller cloud operators
[106].
Substitutes and New Entrants - The Threat of Shifting Platform Boundaries
In an August 2014 article to the Association of Computing Machinery (ACM), noted MIT
adjunct professor of computing science and database luminary Michael Stonebraker observed
that what exactly makes a Hadoop solution Hadoop is fairly ephemeral [107]. Stonebraker
pointed out that the MapReduce-based distributed processing framework that had been
synonymous with Hadoop has been abandoned by newer projects like Cloudera Impala; Impala
uses its own optimized distributed processing engine which accesses the Hadoop Distributed File
System (HDFS) directly. Apache Spark, originally developed independently of the Hadoop
ecosystem, is now embraced by the Hadoop community to an extent that it joins the original
Hadoop MapReduce framework and its successor (Tez) as standard processing frameworks for
Hadoop. A consequence of this, Stonebraker observed that only thing that seems to be a
condition for a platform to be labelled as “Hadoop” is the usage of the HDFS as a storage and
persistence at the bottom of the technology stack.
94
Chapter 3 – A Case Study on Hadoop
Figure 25 - Hadoop in 2011 vs. 2014 – A “Hadoop” deployment in 2011 always contained the components that were considered
part of the ‘core’ Hadoop platform, and likely a component that was a part of what was considered the extended platform. A
deployment of Hadoop in 2014 may not include any components of either sorts.
Though the focus of his article was on another topic, Stonebraker was pointing the
continuous and substantial shifts in the platform boundaries of Hadoop. At Hadoop’s origin back
in 2007, the MapReduce and HDFS framework were clearly defined as “core” to the Hadoop
platform. Subsequent efforts like Apache Hive built upon this core and were so useful and
ubiquitous that they were effectively considered a part of the extended platform. Subsequent
requirements and new technologies have emerged to displace components that were even of this
original platform. In fact, the shifts have been more substantial and the definition of “Hadoop”
murkier than even what Stonebraker posited. The usage of HDFS cannot be relied upon as a
condition for defining what constitutes a “Hadoop” distribution; the MapR distributions of
Hadoop do not use HDFS at all but rather the MapR Distributed File System mentioned earlier.
Consequently, a MapR customer that uses only Impala in their “Hadoop” implementation in
2014 will not use any major components that would be considered “core” to Hadoop only a few
years earlier (Figure 25). Of course, this begs the question – what exactly is the “Hadoop
Platform” if it is not defined by any specific technology or major component?
95
Platform Leadership in Open Source Software
Figure 26 – Displacing Core Components – The fact that platform-internal APIs are well documented in open source systems
allow core components to be substituted (HDFS  MapR NFS). Dependent components (HIVE  Shark) can also be forked and
adapted in the case where clean substitution is not possible (MapReduce  Spark).
.
Given that Hadoop is an industry platform mediating the Hadoop ecosystem, one answer
may be that “Hadoop platform is a collection of technologies that binds together the Hadoop
ecosystem”. While this definition seems tautological on the surface, it actually addresses the
puzzle at hand. While the MapReduce engine and the HDFS encapsulated the original valuegenerating intellectual property that motivated the genesis of the ecosystem, they are not the
technologies that bind the ecosystem together. That responsibility lies with the relatively simpler
interfaces and interfacing subsystems that allow these components to be connected to one
another. By this logic, any technology stack that provides these interfaces ought to be
considered a “Hadoop Platform” provider, even if the technology do not share the same lineage
or development leaders as the others. For example, MapR’s product is considered a Hadoop
distribution, because the company provided a proprietary alternative of HDFS that was both API
and wire protocol compatible with the HDFS components. It is worth noting that while this
would have been theoretically possible in a proprietary platform as well, this type of low-level
component substitution is extremely unlikely to happen in proprietary platforms as interfaces
between internal platform subsystems would not have been documented and easily substitutable.
Interestingly, even in the cases where a platform component cannot cleanly fulfil the
interfaces of a core platform component, the open source nature of the platform meant that
dependent components can be forked and adapted by motivated parties. For example, although
96
Chapter 3 – A Case Study on Hadoop
Apache Spark could not provide exactly the same Mapper and Reducer APIs that the original
MapReduce engine provided for its clients, the Spark team was able to modify popular
dependent components like the popular Hive component to run on top of its engine (Figure 26).
The ability to fork the Hive source code allowed the Spark team to create a component (Shark)
that offered the same client interface as Hive (HQL) and maintain compatibility with dependent
products and applications. The ability to fork and adapt existing components also allow internetwork competitors to hijack key platform components and complements to their competing
platforms. The MapReduce engine, as well as Hive, Spark and Shark, have been forked and
ported to alternative data management platforms such as Apache Cassandra by inter-network
competitors such as Datastax [108].
The availability of internal interfaces and implementation source code has allowed
Hadoop to evolve rapidly. However, it also represents a challenge for commercial vendors
attempting to influence the platform’s trajectory. Neither examples of technology substitution
presented above were made with the approval of a clear platform sponsor. The Apache Software
Foundation, typically viewed as the “platform sponsor” for Hadoop, did not make a conscious
decision that Spark or Shark were worth developing and adopting; Spark became an Apache
project after its first release and initial adoption. Ultimately, the organisms that determined what
technologies would be considered part of the Hadoop platform was the market at large and the
ecosystem as a whole. Technologies that were sufficiently compatible with existing ecosystem
solutions with a sufficiently compelling value proposition were eventually adopted broadly
across the entire market as intra-network competitive forces drive platform vendors to adopt the
best-in-class technology.
This meritocratic nature of platform governance in Hadoop also creates tremendous
opportunities for new entrants to enter the market. As mentioned in the Hadoop market
overview, the immense popularity of Apache Spark has allowed commercial vendor Databricks
(founded and led by Spark’s creators) to gain tremendous influence over the Hadoop ecosystem
despite its late entry and limited scale of operation. In a closed-source ecosystem, it would have
been exceedingly difficult for such a small firm to penetrate what had already become a very
large ecosystem with leaders operating at scale. This examples also suggests that open source
platform leadership is generally less stable than conventional platform leadership. The fact that
97
Platform Leadership in Open Source Software
technology can be adopted and displaced at the whims of the market means that the core
technical competencies that a platform leader has built up can quickly be invalidated if a better
mouse trap emerges. If Apache Spark continues on its current trajectory as the de facto
distributed processing framework of the Hadoop ecosystem, then the value of Hortonwork’s
technical expertise on the MapReduce and Apache Tez technologies could diminish drastically.
As a result of all of this, an aspiring open source platform leader cannot rely solely on the
gravitational pull of platform complements to maintain its leadership positions, as the platform
and the ecosystem can be ‘hijacked’ in the manners described above. It must stay atop of new
technologies and be ready to embrace them in order to maintain its understanding of the
ecosystem. Platform vendors vying to build their businesses upon open source technologies
must remain open to technological change as the market will largely determine the platform
standards for the ecosystem independent of them.
98
Chapter 4 - Conclusion
In his book “Only the Paranoid Survives”, Andrew Grove wrote the following in
reference to technological changes:
“ARE SUCH DEVELOPMENTS A CONSTRUCTIVE OR A DESTRUCTIVE FORCE? IN MY VIEW, THEY
ARE BOTH.
AND THEY ARE INEVITABLE. IN TECHNOLOGY, WHATEVER CAN BE DONE WILL BE DONE.
WE CAN’T STOP THESE CHANGES. WE CAN’T HIDE FROM THEM. INSTEAD, WE MUST FOCUS ON
GETTING READY FOR THEM.” [4]
This quote certainly seems to apply to the world of open source software. While open
source platforms benefit from the same network effects that proprietary platforms enjoy, the
ability of a single firm to harness and direct that growth is greatly hindered by the increased pace
of technological change afforded by the open intellectual property. Aspiring platform contenders
in the open source world cannot rely on their exclusive possession of key platform technology to
direct the behavior of complementors. Instead, they must assess and increase their influence
over the forces that affect the market in order to shape the movement of the ecosystem as a
whole.
Given all the variables that are beyond the control of an open source platform contender,
perhaps the image of a “platform leader” as a powerful orchestrator that directs an ecosystem by
manipulating the levers of its platform empire is an inappropriate one. With fewer levers of
power at its disposal, an open source platform leader is perhaps more like politicians in modern
democracies, leading through influence and relationship building rather than power and
authority. Moreover, continued possession of a leadership position greatly depends on a firm’s
ability to survey the sentiments of its constituents and adjust accordingly. Open source platform
contenders must clearly assess whether or not their primary rivals sit within the same ecosystem
(i.e. network) and then look up, down and across the value network to ensure unique or superior
access to the suppliers, buyers and partners that make up the market. Beyond this, such vendors
must stay abreast of the technological changes that can emerge to reshape the make up the
market or risk having all their efforts quickly invalidated by the emergence of new technologies
that reshape the market landscape.
99
Platform Leadership in Open Source Software
While this image of an open source platform leader as a politician is arguably less aweinspiring of the previous image of a powerful empire leader, it is perhaps a more prevalent and
relevant depiction of platform leadership in the current software market landscape. With few
exceptions, the open source model is becoming the preferred approach for establishing new
platforms, in the same way that most modern modes of governance are democratic and not
authoritarian. Software vendors that seek to operate within platform markets must accept this
new reality and adapt appropriately.
Areas of Further Research
Much of this thesis drew upon the case studies of two open source platforms: Google’s
Android and Apache Hadoop. These two platforms were selected for their relevance to the
consumer and enterprise software markets, as well as the difference in structure and origins
between them. The Android case study illustrates how a firm may choose to establish an open
source model for its own technology under its own terms, and yet still be subject to the fluid
nature of open source platform dynamics. The Hadoop case study illustrates how firms can work
to establish leadership positions for an external platform technology made available through the
open source community and still extract enormous value. Despite the purposeful selection of the
two case studies, the fact that this thesis studied only two platforms is a limitation and further
research to analyze other platforms may be completed to identify additional factors, tactics and
strategies relevant to open source platform leadership.
While this thesis was written with an understanding of the different business models that
are available to open source ecosystems, the analysis did not systematically consider the impact
of these different business models on the firms’ behaviors with regards to platform strategy.
Moreover, the findings of this thesis were descriptive in nature and would be complemented by
further works to establish a prescriptive framework for managing open source platform
leadership. Systematic consideration of the business model would likely be required for such an
effort. Relatedly, both case studies of the thesis focused on platform providers even though
platform leadership is equally applicable to platform sponsors and users. Given that some of the
largest technology companies in the world are open source platform users (and not providers),
case studies focusing on strategies employed by companies acting in these other platform roles
would be beneficial.
100
Appendix
Table 17 - Committers to Apache Spark; Extracted from https://cwiki.apache.org/ on Oct 1st, 2014
Name
Andrew Xia
Stephen Haberman
Mark Hamstra
Aaron Davidson
Andrew Or
Andy Konwinski
Josh Rosen
Matei Zaharia
Michael Armbrust
Patrick Wendell
Reynold Xin
Tathagata Das
Xiangrui Meng
Thomas Dudziak
Prashant Sharma
Jason Dai
Nick Pentreath
Shane Huang
Imran Rashid
Ryan LeCompte
Ankur Dave
Charles Reiss
Haoyuan Li
Joseph Gonzalez
Kay Ousterhout
Mosharaf Chowdhury
Shivaram Venkataraman
Sean McNamara
Mridul Muralidharam
Ram Sriharsha
Robert Evans
Thomas Graves
Organization
Alibaba
Bizo
ClearStory Data
Databricks
Databricks
Databricks
Databricks
Databricks
Databricks
Databricks
Databricks
Databricks
Databricks
Groupon
Imaginea, Pramati, Databricks
Intel
Mxit
National University of Singapore
Quantifind
Quantifind
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
Webtrends
Yahoo!
Yahoo!
Yahoo!
Yahoo!
101
Platform Leadership in Open Source Software
Table 18 - Top 10 Big Data Vendors by Revenue according to Wikibon.org [77]
2013 Worldwide Big Data Revenue by Vendor ($US millions)
Vendor
Big Data
Revenue
Total
Revenue
IBM
$1,368
$99,751
Big Data
Revenue
as % of Total
Revenue
1%
SAP
$545
$22,900
2%
0%
76%
24%
HP
$869
$114,100
1%
42%
14%
44%
Oracle
$491
$37,552
1%
28%
37%
36%
Teradata
$518
$2,665
19%
36%
30%
34%
Microsoft
$280
$83,200
0%
0%
63%
37%
Pivotal
$300
$300
100%
15%
50%
35%
Cloudera
$73
$73
100%
0%
53%
47%
Hortonworks
$55
$55
100%
0%
73%
27%
MapR
$35
$35
100%
0%
77%
23%
Total
$18,607
38%
22%
40%
n/a
n/a
% Big
Data
Hardware
Revenue
31%
% Big
Data
Software
Revenue
27%
% Big
Data
Services
Revenue
42%
Table 19 – Hadoop-related Apache committers by project and organizations; extracted from www.apache.org on May 1st, 2014
Project
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
PMC / Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Name
Pushpinder Heer
Mike Fagan
Arshak Navruzyan
Ed Kohlwey
Andrew George Wells
Alex Moundalexis
Hung Pham
Jessica Seastrom
Jeff Field
Jonathan M. Hsieh
Ryan Fishel
Vikram Srivastava
Aaron Glahe
Christian Rohling
Ravi Mutyala
Steve Loughran
Ted Yu
Jared Winick
Laura Peaslee
Jim Klucar
102
Organization
Applied Physics Laboratory
Applied Technical Systems
Arcus Research
Argyle Data
Booz Allen Hamilton
ClearEdgeIT
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Data Tatics
Endgame
Hortonworks
Hortonworks
Hortonworks
Koverse
Objective Solutions, Inc.
Splyt
Appendix
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Chris McCubbin
Jonathan Park
Luke Brassard
Michael Allen
Michael Berman
Oren Falkowitz
Phil Eberhardt
Miguel Pereira
Damon Brown
Kevin Faro
Dennis Patrone
Al Krinker
Chris Bennight
David M. Lyle
Ed Coleman
Edward Yoon
Jason Then
Jay Shipper
Jesse Yates
Joe Skora
John Stoneham
Matthew Kirkley
Michael Wall
Morgan Haskel
Nguessan Kouame
Philip Young
Ryan Leary
Sapah Shah
Scott Kuehn
Sean Hickey
Supun Kamburugamuva
Tim Halloran
Tim Reardon
Travis Pinney
Ravi Prakash
Aaron T. Myers
Colin Patrick McCabe
Doug Cutting
Eli Collins
Harsh J
Karthik Kambatla
103
sqrrl
sqrrl
sqrrl
sqrrl
sqrrl
sqrrl
sqrrl
SRA International, Inc
Tetra Concepts LLC
Tetra Concepts LLC
The Johns Hopkins University
Altiscale, Inc.
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Platform Leadership in Open Source Software
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Sandy Ryza
Todd Lipcon
Tom White
Alejandro Abdelnur
Andrew Wang
Mayank Bansal
Dhruba Borthakur
Hairong Kuang
Dmytro Molkov
Scott Chun-Yang Chen
Zheng Shao
Andrzej Bialecki
Arun C Murthy
Arpit Agarwal
Arpit Gupta
Bikas Saha
Brandon Li
Chris Nauroth
Devaraj Das
Enis Soztutar
Giridharan Kesavan
Hitesh Shah
Jian He
Jing Zhao
Jitendra Nath Pandey
Mahadev Konar
Matthew Foley
Owen O'Malley
Ramya Sunil
Sanjay Radia
Siddharth Seth
Steve Loughran
Suresh Srinivas
Tsz Wo (Nicholas) Sze
Vinod Kumar Vavilapalli
Haohui Mai
Xuan Gong
Zhijie Shen
Vinayakumar B
Eric Yang
Kan Zhang
104
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
ebay
Facebook
Facebook
Facebook
Facebook
Facebook
Getopt
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Huawei
IBM
IBM
Appendix
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Nigel Daley
Amareshwari Sriramadasu
Sharad Agarwal
Sreekanth Ramakrishnan
Christophe Taton
Devaraj K
Uma Maheswara Rao G
Allen Wittenauer
Boris Shkolnik
Jakob Homan
Lohit Vijayarenu
Chris Douglas
Ivan Mitic
Roman Shaposhnik
Johan Oskarsson
Raghu Angadi
Matei Zaharia
Junping Du
Luke Lu
Konstantin Boudnik
Konstantin Shvachko
Amar Ramesh Kamat
Robert(Bobby) Evans
Daryn Sharp
Jonathan Eagles
Jason Lowe
Kihwal Lee
Koji Noguchi
Mukund Madhugiri
Tanping Wang
Thomas Graves
Gregory Chanan
Jean-Daniel Cryans
Jonathan Hsieh
Jimmy Xiang
Lars George
Michael Stack
Todd Lipcon
Elliott Clark
Matteo Bertozzi
Gary Helmling
105
Individual
InMobi
InMobi
InMobi
INRIA
Intel
Intel
LinkedIn
LinkedIn
LinkedIn
MapR
Microsoft
Microsoft
Pivotal
Twitter
Twitter
UC Berkeley
VMware
VMware
WANdisco
WANdisco
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Continuuity
Platform Leadership in Open Source Software
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hbase
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Jonathan Gray
Ryan Rawson
Doug Meil
Amitanand S. Aiyer
Kannan Muthukkaruppan
Karthik Ranganathan
Mikhail Bautin
Nicolas Spiegelberg
Liyin Tang
Devaraj Das
Enis Soztutar
Jeffrey Zhong
Nick Dimiduk
Sergey Shelukhin
Ted Yu
Rajeshbabu Chintaguntla
Andrew Purtell
Anoop Sam John
Ramkrishna S Vasudevan
Jesse Yates
Lars Hofhansl
Nicolas Liochon
Chunhui Shen
Honghua Feng
Liang Xie
Prasad Mujumdar
Gang Tim Liu
Kevin Wilfong
Siying Dong
Daniel Dai
Alan Gates
Jason Dere
Jitendra Pandey
Sushanth Sowmyan
Owen O'Malley
Prasanth Jayachandran
Sergey Shelukhin
Vaibhav Gumashta
Vikram Dixit
Amareshwari Sriramadasu
Eric Hanson
106
Continuuity
DrawnToScale
Explorys
Facebook
Facebook
Facebook
Facebook
Facebook
Facebook
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Huawei
Intel
Intel
Intel
Salesforce.com
Salesforce.com
Scaled Risk
Taobao
Xiaomi
Xiaomi
Cloudera
Facebook
Facebook
Facebook
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
InMobi
Microsoft
Appendix
Hive
Pig
Pig
Pig
Pig
Pig
Pig
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Spark
Tez
Tez
Tez
Tez
Tez
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Yin Huai
Xuefu Zhang
Mark Wagner
Prashant Kommireddi
Aniket Mokashi
Koji Noguchi
Gianmarco De Francisci
Morales
Stephen Haberman
Mark Hamstra
Aaron Davidson
Andy Konwinski
Matei Zaharia
Patrick Wendell
Reynold Xin
Tathagata Das
Prashant Sharma
Thomas Dudziak
Andrew Xia
Jason Dai
Shane Huang
Nick Pentreath
Imran Rashid
Ryan LeCompte
Ankur Dave
Charles Reiss
Haoyuan Li
Josh Rosen
Kay Ousterhout
Mosharaf Chowdhury
Shivaram Venkataraman
Sean McNamara
Mridul Muralidharam
Ram Sriharsha
Robert Evans
Thomas Graves
Arun C Murthy
Bikas Saha
Gunther Hagleitner
Hitesh Shah
Siddharth Seth
107
The Ohio State University
Inadco
LinkedIn
Salesforce.com
Twitter
Yahoo!
Yahoo!
Bizo
ClearStory Data
Databricks
Databricks
Databricks
Databricks
Databricks
Databricks
Databricks
Groupon
Intel
Intel
Intel
Mxit
Quantifind
Quantifind
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
Webtrends
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Platform Leadership in Open Source Software
Tez
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Accumulo
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
Committer
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
Mike Liddell
Patrick Hunt
Henry Robinson
Benjamin Reed
Thawan Kooburat
Alex Shraer
Mahadev Konar
Andrew Kornev
Flavio Junqueira
Michi Mutsuzaki
Camille Fournier
Benson Margulies
Drew Farris
Bill Havanki
Mike Drob
Sean Busbey
Jason Trost
Billie Rinaldi
Josh Elser
Aaron Cordova
William Slacum
Christopher Tubbs
Corey J. Nolet
Dave Marion
Keith Turner
Brian Loss
Adam Fuchs
John Vines
Eric Newton
Chris Waring
David Medinets
Aaron T. Myers
Doug Cutting
Eli Collins
Patrick Hunt
Michael Stack
Todd Lipcon
Tom White
Alejandro Abdelnur
Dhruba Borthakur
Hairong Kuang
108
Microsoft
Cloudera
Cloudera
Facebook
Facebook
Google
Hortonworks
Individual
Microsoft
Nicira
RentTheRunway
Basis Technology Corp.
Booz Allen Hamilton
Cloudera
Cloudera
Cloudera
Endgame
Hortonworks
Hortonworks
Koverse
Koverse
NSA
Objective Solutions, Inc.
Objective Solutions, Inc.
Peterson Technologies
Praxis Engineering
sqrrl
sqrrl
SW Complete Inc.
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Cloudera
Facebook
Facebook
Appendix
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Hive
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
Zheng Shao
Arun C Murthy
Devaraj Das
Enis Soztutar
Giridharan Kesavan
Hitesh Shah
Jitendra Nath Pandey
Mahadev Konar
Matt Foley
Owen O'Malley
Sanjay Radia
Siddharth Seth
Steve Loughran
Suresh Srinivas
Tsz Wo (Nicholas) Sze
Vinod Kumar Vavilapalli
Hemanth Yamijala
Amareshwari Sriramadasu
Sharad Agarwal
Uma Maheswara Rao G
Nigel Daley
Jakob Homan
Chris Douglas
Raghu Angadi
Luke Lu
Konstantin Shvachko
Robert(Bobby) Evans
Daryn Sharp
Jonathan Eagles
Jason Lowe
Kihwal Lee
Thomas Graves
Brock Noland
Xuefu Zhang
Lefty Leverenz
Yongqiang He
Ning Zhang
Raghotham Murthy
Gunther Hagleitner
Ashutosh Chauhan
Thejas Nair
109
Facebook
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Individual
InMobi
InMobi
Intel
Jive
LinkedIn
Microsoft
Twitter
VMware
WANdisco
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Yahoo!
Cloudera
Cloudera
Doc of the Bay
Dropbox
Facebook
Facebook
Hortonworks
Hortonworks
Hortonworks
Platform Leadership in Open Source Software
Hive
Hive
Hive
Hive
Hive
Hive
Hive
Pig
Pig
Pig
Pig
Pig
Pig
Pig
Pig
Pig
Pig
Pig
Pig
Pig
Pig
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
Zookeeper
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
PMC member
Harish Butani
Carl Steinbach
Edward Capriolo
Navis Ryu
Namit Jain
Ashish Thusoo
Joydeep Sensarma
Santhosh Srinivasan
Daniel Dai
Alan Gates
Giridharan Kesavan
Ashutosh Chauhan
Thejas Nair
Richard Ding
Cheolsoo Park
Bill Graham
Dmitriy Ryaboy
Jonathan Coveney
Julien Le Dem
Olga Natkovich
Rohini Palaniswamy
Patrick Hunt
Henry Robinson
Benjamin Reed
Mahadev Konar
Ted Dunning
Flavio Junqueira
Michi Mutsuzaki
Camille Fournier
Ivan Kelly
110
Hortonworks
LinkedIn
m6d
NexR
Nutanix
Qubole
Qubole
Cloudera
Hortonworks
Hortonworks
Hortonworks
Hortonworks
Hortonworks
IBM
Netflix
Twitter
Twitter
Twitter
Twitter
Yahoo!
Yahoo!
Cloudera
Cloudera
Facebook
Hortonworks
MapR
Microsoft
Nicira
RentTheRunway
Yahoo!
Appendix
Table 20 - ISVs and Technology Partners Matrix – Black cells represent a partnership arrangement exists; data extracted from
www.hortonworks.com, www.cloudera.com and www.mapr.com on November 27th, 2014
Complement Vendor
Cloudera
0xdata
Abitech Software
Acentrix
Actian
Actuate
Adatao
Admatic
Aeronomy
Aeverie Inc.
Affini-Tech
Aha! Software
AllianceONE
AlphaSix Corporation
Alpine Data Labs
Alteryx
Amazon
Amdocs
Anchormen
Apara Solutions
Apervi
APEXCNS
Apigee
Appcara
Appfluent
AquaFold
Argil Data
Argyle Data
Arieso
Atigeo
AtScale
Attivio
Attunity
Ayasdi
Aziksa
Azul Systems
Basement Supercomputing
Basis Technology
BC Cloud
Hortonworks
1
0
0
1
1
1
0
0
0
0
1
0
0
1
1
0
0
0
0
0
0
0
1
1
1
0
0
0
1
1
1
1
1
0
1
0
1
0
111
0
1
0
1
1
0
1
0
0
0
0
0
0
1
1
0
1
0
0
1
0
1
0
1
0
1
1
1
0
1
0
1
1
0
0
1
0
0
MapR
1
0
1
0
0
0
0
1
1
1
0
1
1
1
1
1
0
1
1
0
1
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
Platform Leadership in Open Source Software
BDI Systems
BeagleData
Big Data Elephants, Inc.
Big Data Partnership
Big Switch Networks
BioDatomics
BIPD Ltd
BIPortal GmbH
Birst, Inc
Bit Stew Systems
Blue Canopy Group, LLC
BlueData
BMC Software
Booz Allen Hamilton
BPM-Conseil
BrainPad Inc.
Bright Computing
Brillio
Broadgate Inc
Calpont Corporation
Canonical
CAS
Caserta Concepts
Celer Technologies
Centerity Systems, Inc.
Centrify
Century Link
Ciber
cimt AG
Cirro
Cisco
ClearDATA
Cleo
Cloud A
Cloudian
Cloudsoft
Comma Soft
Composite Software
Compsesa
Computertekk
Compuware
0
0
0
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
1
0
0
0
112
0
0
0
0
1
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
1
1
0
1
0
1
1
1
1
1
1
1
0
0
1
1
0
0
1
0
0
1
0
1
0
1
1
0
1
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
0
0
0
Appendix
comSysto
Concurrent
Contexti
Continuent
Continuuity
Couchbase
CSC
Cumulus Networks
Data Center Warehouse
Data Tactics Corporation
Databox
Databricks
Datagres
Dataguise
DataHub
Dataiku
Datalakes
Datameer
DataRPM
DataStax
DataTorrent
Datawatch
DBSync
Dell
Denodo
Digital Reasoning
DigitalRoute
Diyotta
Dragonfly Data Factory
eCapital Advisors
Edis Consulting
Elasticsearch
EngineRoom.io
Envision IT Group
Eruces
Esri
eTouch Systems
Eucalyptus Systems, Inc
Exar
Exasol
Excedis
0
1
0
1
1
1
0
0
0
0
1
0
1
1
0
0
1
1
1
1
1
1
1
1
1
1
0
1
0
0
0
1
1
0
1
1
0
1
0
1
0
113
0
1
0
1
1
1
1
1
1
0
0
1
0
1
0
0
1
1
1
1
1
1
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
0
1
0
1
1
1
1
1
0
0
0
0
1
1
0
1
0
1
1
1
0
1
1
0
1
0
0
0
0
1
0
1
1
1
1
1
0
1
0
0
1
0
0
0
0
Platform Leadership in Open Source Software
Expert System
Feedzai
FICO
First Light Technologies
Formation Data Systems
FORMCEPT Technologies
Fortscale
Fusionex
Fusion-io
Fuzzy Logix, LLC.
Globalscape
Globant
GoGrid
Google
Grand Logic, Inc.
GraphLab
GrayMatter
Gruter
GTRI
H2O
Hadapt
HP
HStreaming
IBM
Ideation816 Corporation
IKANOW
Impetus Technologies
Indigo New Zealand Limited
Infobright
Infochimps, Inc
Informatica
Information Builders
InfoTrellis
Ingenious Qube
InsightsOne
IntegriChain
Interactive Algorithms Inc.
is-land Systems Inc.
iTalent Corporation
Jaspersoft
JethroData
1
1
1
0
0
1
1
0
0
1
0
0
1
0
1
1
0
0
0
0
1
0
1
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
1
1
114
0
0
0
0
1
0
0
1
0
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
0
0
0
1
1
1
1
1
1
0
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
1
0
1
0
0
1
0
1
0
0
1
0
1
1
0
1
1
1
0
1
1
0
0
0
0
1
1
1
1
1
Appendix
Jinfonet Software
Jinfonet Software
Joyent
Kapow Software
Karmasphere
Keylink Technology
Knime.com
Knowledgent
Kognitio
Koverse
KPI Partners Inc.
LG CNS
Likya Teknoloji
Logi Analytics
Looker
LSI
Lucidworks
ManTech
MarkLogic
MBI Solutions
Mellanox Technologies
MetiStream
Metric Insights
Micromata
Microsoft
MicroStrategy
Mikan Associates
MisOne Solution
MongoDB
MSR Cosmos, LLC
Narus
Nautilus Technologies
NetApp
New Relic
NFLabs
NGData
Nimbix
Nimble Storage
NorCom
Novetta Solutions
NS Solutions
0
0
1
1
1
0
1
0
0
1
0
0
0
1
1
0
1
0
0
1
0
0
1
0
1
1
0
0
1
0
1
0
0
1
1
1
1
0
0
1
0
115
1
1
1
0
0
0
0
0
1
1
0
1
1
1
0
1
1
0
1
0
0
0
0
0
1
1
0
0
1
0
1
0
0
0
0
0
0
1
0
1
0
1
1
0
0
1
1
0
1
0
1
1
1
0
0
0
0
1
1
0
0
1
1
0
1
0
0
1
1
1
1
1
1
1
0
0
0
1
0
1
0
1
Platform Leadership in Open Source Software
NTC Vulkan
Nutanix
NxtGen
O2MC
OCTO
Onepoint IQ
Onramp Corporation
OnX Enterprise Solutions
Open V
OpenOsmium
Options I/O
Oracle
Orzota
OS Nexus
ParAccel
Paxata
Pentaho
Pepperdata
Persistent Systems
Pervasive
PetaSecure, Inc.
PHEMI
Platfora
Plivo
Podium Data
Polyform Labs
Pragmatix Services
Pragsis
Predixion Software
Prime Dimensions, LLC
Protegrity
PSSC Labs
Puppet Labs
Qlik
Quaero
QuantCell Research
Qubole
Quest
QuickLogix LLC
Rackspace
Radoop
0
0
1
0
0
0
0
0
0
0
0
1
0
1
1
1
1
1
0
1
0
0
1
1
0
1
0
0
1
0
1
0
1
1
1
1
0
1
0
0
1
116
0
1
0
1
0
0
0
0
1
0
0
1
0
0
0
0
1
1
0
0
1
1
1
0
1
0
1
0
0
0
1
1
0
1
0
0
1
0
0
1
0
1
0
0
0
1
1
1
1
0
1
1
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
1
0
0
Appendix
RainStor
Red Gate
Red Hat
RedOwl Analytics
RedPoint Global
Reltio
Revelytix
Revolution Analytics
RTTS
SAP
SAS
Scaled Risk
ScaleOut Software
Search Technologies
Securonix
Semantic Research
Sematext
SequenceIQ
SequoiaDB
Serendio
Servient
SGI
SHS-Viveon
Simba
Sisense
Skytree
Smart Platform
SMP Management AG
SnapLogic
Softlayer
SoftNet Solutions
Solarflare
Solix Technologies
Sophias
Spectra Logic
Splice Machine
Splunk
Spring
SQLstream
Sqrrl
StackIQ
1
0
0
1
1
1
1
1
0
1
1
0
1
1
1
1
1
0
1
0
1
0
0
1
1
1
1
0
1
1
0
1
1
0
0
1
1
0
1
1
1
117
1
1
1
0
1
0
1
1
1
1
1
1
1
0
0
0
1
1
0
1
1
1
0
1
0
1
0
0
1
0
0
0
1
0
1
0
1
1
0
1
1
1
0
1
0
1
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
0
1
1
0
1
0
1
1
0
0
1
1
Platform Leadership in Open Source Software
SteamBase
SUSE
Syncsort
SYNTASA
Tableau
Talend
Tamr
Targit
Tata Consultancy Services
Teradata
Tervela
Think Big Analytics
TIBCO Software
Tidemark
Trace3
Transcend Business
Intelligence
TransLattice, Inc.
Trendwise Analytics
Tresata
Trifacta
Tri-IT Solutions
Tugbiz
Twingo
Typesafe
Ubeeko
Ubuntu
UL Environment
Unbelievable Machine
Univa
Vanilla
Veristorm
Vintech Solutions, Inc
Violin Memory
VMware
Voltage Security
VoltDB
Vormetric
WANdisco
Waterline Data Science
WE-Ankor
118
1
1
1
0
1
1
1
1
0
0
1
0
1
1
0
0
1
1
1
1
1
1
0
0
1
0
0
1
0
0
0
0
1
0
1
1
0
0
1
1
0
1
0
0
1
0
1
0
1
1
0
1
0
0
1
1
0
0
1
0
1
0
0
1
1
1
1
1
0
0
1
0
0
1
1
0
0
0
1
1
0
0
0
0
0
1
0
0
1
1
0
1
1
1
0
0
0
1
0
0
1
0
1
0
1
0
1
1
1
1
0
1
1
1
1
1
0
0
1
1
Appendix
WHISHWORKS
WhiteKlay
Wibidata, Inc
World Wide Technology
X15 Software
Xenolytics
Xiilab
XOR Security
Xplenty
Yeswici LLC
Ysance
Z Data Inc.
Zaloni
Zementis
Zettaset
Zettics
Zoho WebNMS
Zoomdata
Zuhlke Engineering
Total Number of Vendors
0
0
1
0
1
0
0
1
1
0
0
0
0
1
0
0
1
1
0
164
119
0
0
0
0
1
1
1
1
1
1
0
0
0
1
1
1
0
0
1
156
1
1
0
1
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
159
Platform Leadership in Open Source Software
This page is intentionally left blank.
120
List of Figures
Figure 1 – A system dynamics model of direct network effects ..................................................... 6
Figure 2 – A system dynamics model of a two-sided platform....................................................... 8
Figure 3 –Roles and Relationships in a Platform-Mediated Network .......................................... 10
Figure 4 – Linux marketshare in various computing segments .................................................... 17
Figure 5 – Results from the "Java Use and Awareness Study" from BZ Research, 2005 ............ 24
Figure 6 – Eclipse Project Committer by Company ..................................................................... 25
Figure 7 – Porter's Five Forces Model .......................................................................................... 27
Figure 8 – Grove’s Six Forces Diagram ....................................................................................... 28
Figure 9 – The Android Platform .................................................................................................. 31
Figure 10 – Inter-network and Intra-network Competition .......................................................... 34
Figure 11 – Hierarchy of influence within an Apache Software Foundation project ................... 38
Figure 12 – The Purchase Process of Complements ..................................................................... 46
Figure 13 – Example of Platform Fragmentation. ........................................................................ 50
Figure 14 – Ecosytem Hijacking .................................................................................................. 52
Figure 15 – High-level Architecture of Android and Blackberry OS 10 ...................................... 53
Figure 16 – Google search trends of “Hadoop” and “Big Data” vs. ”Data Warehouse” ............. 63
Figure 17 – Major Building Blocks of a Hadoop Application Stack ............................................ 64
Figure 18 – Diagram of basic MapReduce execution ................................................................... 66
Figure 19 – Cumulative Investments in Pureplay Hadoop Vendors ............................................. 73
Figure 20 – Big Data-related Software and Services Revenue .................................................... 74
Figure 21 – Official Committers to Apache Spark by Organization............................................. 78
Figure 22 – Hadoop Contributors by Organization ...................................................................... 84
Figure 23 – Decision-tree for Selecting Development Model ...................................................... 88
121
Platform Leadership in Open Source Software
Figure 24 – Analysis of Exclusivity of Partnership Arrangements for Hadoop ISVs .................. 95
Figure 25 – Hadoop in 2011 vs. 2014 ........................................................................................... 95
Figure 26 – Displacing Core Components .................................................................................... 95
122
List of Tables
Table 1 – Open source platforms by commercial firms .................................................................. 2
Table 2 – Comparison of Openness by Role in Platform-mediated Networks ............................. 10
Table 3 – Taxonomy of Envelopment Attacks .............................................................................. 15
Table 4 – Ten criteria of open source software ............................................................................. 20
Table 5 – Apple, IBM and Sun Microsystem's Involvement in Open source ............................... 21
Table 6 – Summary of Strategic Considerations for Open Source Platform Vendors .................. 29
Table 7 – AOSP-derived Products by Google Competitors .......................................................... 33
Table 8 – Google's Shift of Investment into Proprietary Capabilities. ......................................... 36
Table 9 – Decision Making Authorities in Different Open Source Communities ........................ 41
Table 10 – The Three V's of Big Data .......................................................................................... 59
Table 11 – A Selection of SQL on Hadoop offerings ................................................................... 70
Table 12 – Breakdown of Hadoop-market according to Forrester Research ................................ 72
Table 13 – Cumulative Investments in Pureplay Hadoop Vendors ............................................... 73
Table 14 – Sample Hadoop Positioning Statements by Enterprise Software vendors .................. 76
Table 15 – Partnership Matrix Between Pure play Vendors and Enterprise Software Vendors .... 76
Table 16 – Partnerships between Pure play Hadoop vendors and cloud IaaS vendors ................. 93
Table 17 – Committers to Apache Spark .................................................................................... 101
Table 18 – Top 10 Big Data Vendors by Revenue ...................................................................... 102
Table 19 – Hadoop-related Apache committers by project and organizations ........................... 102
Table 20 – ISVs and Technology Partners Matrix .......................................................................111
123
Platform Leadership in Open Source Software
This page is intentionally left blank.
124
References
[1]
R. Stallman, “The GNU operating system and the free software movement,” Open sources
Voices from open source Revolut., 1999.
[2]
R. Gilbert and M. Katz, “An economist’s guide to US v. Microsoft,” J. Econ. Perspect.,
2001.
[3]
M. Olson, “The Cloudera Model,” LinkedIn, 2013. [Online]. Available:
http://www.linkedin.com/today/post/article/20131003190011-29380071-the-clouderamodel. [Accessed: 31-Mar-2014].
[4]
A. S. Grove, Only the Paranoid Survive. Doubleday, 1996.
[5]
R. Schmalensee, “Jeffrey Rohlfs’ 1974 Model of Facebook,” vol. 7, no. 1, 2011.
[6]
J. Rohlfs, “A theory of interdependent demand for a communications service,” Bell J.
Econ. Manag. …, 1974.
[7]
M. Katz and C. Shapiro, “Network externalities, competition, and compatibility,” Am.
Econ. Rev., vol. 75, no. 3, pp. 424–440, 1985.
[8]
A. Gawer and R. Henderson, “Platform owner entry and innovation in complementary
markets: Evidence from Intel,” J. Econ. Manag. …, vol. 16, no. 1, pp. 1–34, 2007.
[9]
A. Gawer and M. Cusumano, “Industry platforms and ecosystem innovation,” J. Prod.
Innov. …, 2013.
[10] M. Cusumano and A. Gawer, Platform Leadership: How Intel, Microsoft, and Cisco Drive
Industry Innovation [Hardcover]. Harvard Business Press; 1 edition, 2002, p. 305.
[11]
O. de Weck, E. Suh, and D. Chang, “Product family strategy and platform design
optimization,” pp. 1–38, 2004.
[12] M. Cusumano, “Technology strategy and managementThe evolution of platform
thinking,” Commun. ACM, vol. 53, no. 1, p. 32, Jan. 2010.
[13] K. Boudreau, “Let a thousand flowers bloom? An early look at large numbers of software
app developers and patterns of innovation,” Organ. Sci., 2012.
125
Platform Leadership in Open Source Software
[14] G. Parker and M. Van Alstyne, “Two-sided network effects: A theory of information
product design,” Manage. Sci., 2005.
[15] T. Eisenmann, “Opening platforms: how, when and why?,” This Pap. has been …, 2008.
[16] A. Gawer, “The organization of platform leadership: an empirical investigation of intel’s
management processes aimed at fostering complementary innovation by third,” 2000.
[17] R. Henderson and K. Clark, “Architectural innovation: the reconfiguration of existing
product technologies and the failure of established firms,” Adm. Sci. Q., 1990.
[18] T. Eisenmann, G. Parker, and M. W. Van Alstyne, “Platform Envelopment,” 2007.
[19] D. Teece, “Profiting from technological innovation: Implications for integration,
collaboration, licensing and public policy,” Res. Policy, vol. 15, no. February, pp. 285–
305, 1986.
[20] A. Gillen, “Worldwide Client and Server Operating Environments 2013–2017 Forecast
and 2012 Vendor Shares,” IDC Research, 2013. [Online]. Available:
http://www.idc.com.libproxy.mit.edu/getdoc.jsp?containerId=243003. [Accessed: 17-Jul2014].
[21] “Operating system Family / Linux | TOP500 Supercomputer Sites,” 2014. [Online].
Available: http://www.top500.org/statistics/details/osfam/1. [Accessed: 17-Jul-2014].
[22] C. DiBona and S. Ockman, Open sources: Voices from the open source revolution. 1999.
[23] O. S. Initiative, “History of the OSI,” About the OSI. [Online]. Available:
http://opensource.org/about. [Accessed: 15-Sep-2014].
[24] S. Krishnamurthy, “An analysis of open source business models,” Perspect. Free open
source Softw., 2005.
[25] J. West, “How open is open enough?: Melding proprietary and open source platform
strategies,” Res. Policy, 2003.
[26] P. G. Capek, S. P. Frank, S. Gerdt, and D. Shields, “A history of IBM’s open source
involvement and strategy,” IBM Syst. J., vol. 44, no. 2, pp. 249–257, 2005.
126
References
[27] N. Economides and E. Katsamakas, “Two-sided competition of proprietary vs. open
source technology platforms and the implications for the software industry,” Manage. Sci.,
2006.
[28] “Microsoft Uses Open source Code Despite Denying Use of Such Software - WSJ.”
[Online]. Available: http://online.wsj.com/news/articles/SB992819157437237260.
[Accessed: 04-Aug-2014].
[29] S. O’Mahony, F. Diaz, and E. Mamas, “Ibm and Eclipse (a),” Harvard Bus. Sch. Case, pp.
1–20, 2005.
[30] M. Cusumano and A. Gawer, “The elements of platform leadership,” IEEE Eng. Manag.
Rev., vol. 43, no. 3, 2003.
[31] M. Porter, How competitive forces shape strategy. 1979.
[32] O. Alliance, “Open handset alliance,” Retrieved August, 2011.
[33] “Nokia X products - Nokia.” [Online]. Available:
http://www.microsoft.com/en/mobile/phones/nokia-x/. [Accessed: 23-Dec-2014].
[34] J. Osawa, Chinese Software to Challenge Android - WSJ.com. Online.wsj.com, 2012.
[35] Baidu prepares mobile operating system. Financial Times, 2011.
[36] R. Brandom, This is Nokia X: Android and Windows Phone collide. The Verge, 2013.
[37] I. Research, “Worldwide Smartphone Shipments Edge Past 300 Million Units in the
Second Quarter,” IDC Research, 2014. [Online]. Available:
http://www.idc.com/getdoc.jsp?containerId=prUS25037214. [Accessed: 18-Aug-2014].
[38] R. Amadeo, “Google’s iron grip on Android: Controlling open source by any means
necessary | Ars Technica,” Arstechnica, 2013. [Online]. Available:
http://arstechnica.com/gadgets/2013/10/googles-iron-grip-on-android-controlling-open
source-by-any-means-necessary/. [Accessed: 12-Aug-2014].
[39] J. Brodkin, “Google blocked Acer’s rival phone to prevent Android ‘fragmentation’ | Ars
Technica,” Arstechnica, 2012. [Online]. Available:
127
Platform Leadership in Open Source Software
http://arstechnica.com/gadgets/2012/09/google-blocked-acers-rival-phone-to-preventandroid-fragmentation/. [Accessed: 19-Aug-2014].
[40] “How the ASF works.” [Online]. Available: http://www.apache.org/foundation/how-itworks.html. [Accessed: 01-Sep-2014].
[41] A. S. Foundation, “Project Management Committee Guide,” 2012. [Online]. Available:
http://www.apache.org/dev/pmc.html#what-is-a-pmc. [Accessed: 30-Aug-2014].
[42] E. S. Foundation, “Eclipse Development Process 2011,” Eclipse Development Process,
2011. [Online]. Available:
https://www.eclipse.org/projects/dev_process/development_process.php#4_6_1_PMC.
[Accessed: 30-Aug-2014].
[43] “Understanding the Open Source Development Model.” [Online]. Available:
file:///D:/Downloads/lf_os_devel_model.pdf. [Accessed: 30-Aug-2014].
[44] “Roles and Leadership — Mozilla.” [Online]. Available: https://www.mozilla.org/enUS/about/governance/roles/. [Accessed: 30-Aug-2014].
[45] “List of Projects | projects.eclipse.org.” [Online]. Available: https://projects.eclipse.org/.
[Accessed: 23-Dec-2014].
[46] N. Daidj and T. Isckia, “Entering the economic models of game console manufacturers,”
Commun. Strateg., 2009.
[47] J. Prieger and W. Hu, “Applications barrier to entry and exclusive vertical contracts in
platform markets,” Econ. Inq., 2012.
[48] “Frequently Asked Questions | Android Open Source.” [Online]. Available:
https://source.android.com/faqs.html#what-is-the-role-of-google-play-in-compatibility.
[Accessed: 30-Oct-2014].
[49] “BlackBerry, Amazon Licensing Agreement to Bring Thousands of New Apps | Inside
BlackBerry.” [Online]. Available: http://blogs.blackberry.com/2014/06/amazonappstore/?utm_medium=social&utm_source=TWITTER:BlackBerry&utm_campaign=Ap
ps&linkId=8550417. [Accessed: 30-Oct-2014].
128
References
[50] “Ahead Of Smartphone Launch, Amazon Announces Its Appstore Has Tripled Year-OverYear To 240,000 Apps | TechCrunch.” [Online]. Available:
http://techcrunch.com/2014/06/16/ahead-of-smartphone-launch-amazon-announces-itsappstore-has-tripled-year-over-year-to-240000-apps/. [Accessed: 30-Oct-2014].
[51] “Portland Project hits 1.0 milestone | Ars Technica.” [Online]. Available:
http://arstechnica.com/uncategorized/2006/10/7977/. [Accessed: 10-Sep-2014].
[52] “Meet the BlackBerry wizardry that created its ‘better Android than Android’ • The
Register.” [Online]. Available:
http://www.theregister.co.uk/2013/11/25/revealed_how_blackberry_made_its_better_andr
oid_than_android/. [Accessed: 03-Sep-2014].
[53] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,”
Communications of the ACM, 2005. [Online]. Available:
http://research.google.com/archive/mapreduce-osdi04-slides/index.html. [Accessed: 02Apr-2014].
[54] S. Ghemawat, H. Gobioff, and S. Leung, “The Google file system,” ACM SIGOPS Oper.
Syst. …, 2003.
[55] Hadoop: The Definitive Guide [Paperback]. O’Reilly Media; Third Edition edition, 2012,
p. 688.
[56] S. Kohr, “The Origins of ‘Big Data’: An Etymological Detective Story,” New York Times,
2013. [Online]. Available: http://bits.blogs.nytimes.com/2013/02/01/the-origins-of-bigdata-an-etymological-detective-story/?_php=true&_type=blogs&_r=0. [Accessed: 23-Sep2014].
[57] “A Personal Perspective on the Origin (s) and Development of ‘Big Data’: The
Phenomenon, the Term, and the Discipline∗,” 2012.
[58] D. Laney, “3D data management: Controlling data volume, velocity and variety,” META
Gr. Res. Note, 2001.
[59] M. Beyer and D. Laney, “The Importance of’Big Data': A Definition,” Stamford, CT Gart.,
2012.
129
Platform Leadership in Open Source Software
[60] E. F. Codd, “A relational model of data for large shared data banks,” Commun. ACM, vol.
13, no. 6, pp. 377–387, Jun. 1970.
[61] N. Shamgunov, “Scaling Up And Out,” Dr. Dobb’s, 2012. [Online]. Available:
http://www.drdobbs.com/database/scaling-up-and-out/240142249. [Accessed: 11-Oct2014].
[62] M. Gualtieri and N. Yuhanna, “The Forrester Wave TM : Big Data Hadoop,” 2014.
[63] “Gartner Says Business Intelligence and Analytics Need to Scale Up to Support Explosive
Growth in Data Sources.” [Online]. Available:
http://www.gartner.com/newsroom/id/2313915. [Accessed: 02-Nov-2014].
[64] J. Kelly, “Data Warehouse Vendors Moving To Contain The Hadoop Threat,” Wikibon,
2014. [Online]. Available:
http://wikibon.org/wiki/v/Data_Warehouse_Vendors_Moving_to_Contain_the_Hadoop_T
hreat. [Accessed: 03-Nov-2014].
[65] “Google Trends - Web Search interest: hadoop, big data, data warehouse - Worldwide,
2004 - present.” [Online]. Available:
http://www.google.com/trends/explore#q=Hadoop%2C Big Data%2C Data
warehouse&cmpt=q. [Accessed: 03-Nov-2014].
[66] “HDFS Alternatives - Hadoop Ecosystem.” [Online]. Available:
http://hadoopecosystem.whatazoo.com/home/services/core-layers/persist/hdfs/hdfsalternatives. [Accessed: 06-Dec-2014].
[67] “Apache Hadoop YARN – Background and an Overview - Hortonworks.” [Online].
Available: http://hortonworks.com/blog/apache-hadoop-yarn-background-and-anoverview/. [Accessed: 16-Oct-2014].
[68] M. Zaharia and M. Chowdhury, “Spark: cluster computing with working sets,” … cloud
Comput., 2010.
[69] “Spark Incubation Status - Apache Incubator.” [Online]. Available:
http://incubator.apache.org/projects/spark.html. [Accessed: 10-Nov-2014].
130
References
[70] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel
programs from sequential building blocks,” ACM SIGOPS Oper. …, 2007.
[71] “Apache Hadoop 2 is now GA! - Hortonworks.” [Online]. Available:
http://hortonworks.com/blog/apache-hadoop-2-is-ga/. [Accessed: 25-Oct-2014].
[72] “Community Effort Driving Standardization of Apache Spark Through Expanded Role in
Hadoop Projects.” [Online]. Available:
http://www.cloudera.com/content/cloudera/en/about/press-center/pressreleases/2014/07/01/community-effort-driving-standardization-of-apache-sparkthrough.html. [Accessed: 10-Nov-2014].
[73] C. Olston, B. Reed, and U. Srivastava, “Pig latin: a not-so-foreign language for data
processing,” Proc. 2008 …, 2008.
[74] “Ambari Incubation Status - Apache Incubator.” [Online]. Available:
http://incubator.apache.org/projects/ambari.html. [Accessed: 01-Nov-2014].
[75] CrunchBase, “CrunchBase Data Exports,” 2014. [Online]. Available:
http://info.crunchbase.com/about/crunchbase-data-exports/. [Accessed: 04-Nov-2014].
[76] Hortonworks, “Form S-1 Registration Statement under the securities act of 1933,” 2014.
[Online]. Available:
http://www.sec.gov/Archives/edgar/data/1610532/000119312514405390/d748349ds1.htm
. [Accessed: 17-Nov-2014].
[77] “Big Data Vendor Revenue And Market Forecast 2013-2017 - Wikibon.” [Online].
Available:
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017.
[Accessed: 03-Nov-2014].
[78] P. Z. D. D. K. P. T. D. D. C. J. Giles, Harness the Power of Big Data - The IBM Big Data
Platform. .
[79] H. Blog, “SAP + Hortonworks = Instant Access + Infinite Scale with HANA + Hadoop.”
[Online]. Available: http://hortonworks.com/partner/sap/. [Accessed: 08-Nov-2014].
131
Platform Leadership in Open Source Software
[80] Oracle Corporation, “Introduction to Oracle Database,” 2013. [Online]. Available:
http://docs.oracle.com/cd/E11882_01/server.112/e25789/intro.htm#CNCPT88781.
[Accessed: 01-Nov-2014].
[81] “Title: UDA Data Sheet: Exploit All Your Data with Teradata Unified Data
ArchitectureTM.” [Online]. Available:
http://www.teradata.com/Resources/Brochures/UDA-Data-Sheet-Exploit-All-Your-Datawith-Teradata-Unified-Data-Architecture/?LangType=1033&LangSelect=true. [Accessed:
08-Nov-2014].
[82] “Microsoft Analytics Platform System Solution Brief.” [Online]. Available:
file:///C:/Users/kencw_000/Downloads/Analytics_Platform_System_Solution_Brief.pdf.
[Accessed: 08-Nov-2014].
[83] “Find Partners | MapR.” [Online]. Available: https://www.mapr.com/partners/find-partner.
[Accessed: 09-Nov-2014].
[84] “Partners.” [Online]. Available:
http://www.cloudera.com/content/cloudera/en/partners.html. [Accessed: 09-Nov-2014].
[85] “We do Hadoop. Together.” [Online]. Available: http://hortonworks.com/partners/.
[Accessed: 09-Nov-2014].
[86] “Magic Quadrant for Cloud Infrastructure as a Service,” Gartner Group. [Online].
Available: http://www.gartner.com/technology/reprints.do?id=11UKQQA6&ct=140528&st=sb. [Accessed: 06-Dec-2014].
[87] A. T. Labs., “Hadoop Deployment Comparison Study.”
[88] “Committers - Spark - Apache Software Foundation.” [Online]. Available:
https://cwiki.apache.org/confluence/display/SPARK/Committers. [Accessed: 10-Nov2014].
[89] “Cloudera Enterprise 5 Announced - insideBIGDATA.” [Online]. Available:
http://insidebigdata.com/2013/10/29/cloudera-enterprise-5-announced/. [Accessed: 12Nov-2014].
132
References
[90] “Cloudera Plans Data Hub Role For Hadoop - InformationWeek.” [Online]. Available:
http://www.informationweek.com/big-data/software-platforms/cloudera-plans-data-hubrole-for-hadoop/d/d-id/1112099. [Accessed: 12-Nov-2014].
[91] J. Twentyman, “Cloudera vs. Hortonworks: Hadoop to complement or replace data
warehouse.” [Online]. Available: http://www.computerweekly.com/feature/Cloudera-vHortonworks-Hadoop-to-complement-replace-data-warehouse.
[92] “Cloudera Trash Talks With Enterprise Data Hub Release - InformationWeek.” [Online].
Available: http://www.informationweek.com/big-data/software-platforms/cloudera-trashtalks-with-enterprise-data-hub-release/d/d-id/1113677. [Accessed: 14-Nov-2014].
[93] “Rob ‘Flipper’ Bearden plans to FLOAT his Hadoop heffalump • The Register.” [Online].
Available:
http://www.theregister.co.uk/2013/11/21/rob_bearden_hortonworks_playbook/?page=2.
[Accessed: 20-Nov-2014].
[94] “Teradata Portfolio for Hadoop.” [Online]. Available: http://www.teradata.com/TeradataPortfolio-for-Hadoop/?LangType=1033&LangSelect=true. [Accessed: 31-Dec-2014].
[95] “Here’s why HP invested $50M in the Hortonworks approach to Hadoop — Tech News
and Analysis.” [Online]. Available: https://gigaom.com/2014/08/02/heres-why-hpinvested-50m-in-the-hortonworks-approach-to-hadoop/. [Accessed: 31-Dec-2014].
[96] “MapR, Teradata Ink Deal, Bad Timing for Hortonworks?” [Online]. Available:
http://www.cmswire.com/cms/big-data/mapr-teradata-ink-deal-bad-timing-forhortonworks-027253.php. [Accessed: 31-Dec-2014].
[97] K. W. Mike Olson, “Private Interview.” .
[98] “Intel and Cloudera: Why we’re better together for Hadoop - TechRepublic.” [Online].
Available: http://www.techrepublic.com/blog/data-center/intelcloudera/. [Accessed: 10Apr-2014].
[99] “Intel Validates Hadoop Market - Wikibon.” [Online]. Available:
http://wikibon.org/wiki/v/Intel_Validates_Hadoop_Market. [Accessed: 20-Nov-2014].
133
Platform Leadership in Open Source Software
[100] “The Community Effect | Cloudera Engineering Blog.” [Online]. Available:
http://blog.cloudera.com/blog/2011/10/the-community-effect/. [Accessed: 24-Nov-2014].
[101] “Reality Check: Contributions to Apache Hadoop - Hortonworks.” [Online]. Available:
http://hortonworks.com/blog/reality-check-contributions-to-apache-hadoop/. [Accessed:
24-Nov-2014].
[102] “The Yahoo! Effect - Hortonworks.” [Online]. Available: http://hortonworks.com/blog/theyahoo-effect/. [Accessed: 25-Nov-2014].
[103] “Enterprise Hadoop from Hortonworks.” [Online]. Available:
http://hortonworks.com/why-hortonworks/. [Accessed: 27-Nov-2014].
[104] “Hortonworks CEO Rob Bearden: Beware the Hadoop fragmentation | ZDNet.” [Online].
Available: http://www.zdnet.com/hortonworks-ceo-rob-bearden-beware-the-hadoopfragmentation-7000013961/. [Accessed: 14-Nov-2014].
[105] “Built to Last: How MapR’s Business Model Supports That Goal | MapR.” [Online].
Available: https://www.mapr.com/blog/built-to-last-how-maprs-business-model-supportsthat-goal#.VHavzovF_D8. [Accessed: 27-Nov-2014].
[106] “Cloudera whoops as its Hadoop loop-the-loops for cloud troupe • The Register.”
[Online]. Available:
http://www.theregister.co.uk/2013/10/28/cloudera_hadoop_cloud_partnerships/.
[Accessed: 01-Dec-2014].
[107] M. Stonebraker, “Hadoop at a Crossroads?,” Communications of the ACM, 2014. [Online].
Available: http://cacm.acm.org/blogs/blog-cacm/177467-hadoop-at-acrossroads/fulltext#.U_-F6RqsWmc.twitter. [Accessed: 07-Oct-2014].
[108] “Analytics with Cassandra : DataStax.” [Online]. Available:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apachehadoop. [Accessed: 05-Dec-2014].
134
Download