Platform Leadership in Open Source Software By Ken Chi Ho Wong Bachelor of Science, Computing Science, Simon Fraser University, 2005 SUBMITTED TO THE SYSTEM DESIGN AND MANAGEMENT PROGRAM IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN ENGINEERING AND MANAGEMENT at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2015 ©2015 Ken Wong. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of Author: ___________________________________________________________ Ken Wong System Design and Management Program February 2015 Advised by: ___________________________________________________________ Michael Cusumano SMR Distinguished Professor of Management & Engineering Systems MIT Sloan School of Management Certified by: ___________________________________________________________ Patrick Hale Director, System Design and Management Program Massachusetts Institute of Technology Platform Leadership in Open Source Software This page is intentionally left blank. Platform Leadership in Open Source Software By Ken Chi Ho Wong Submitted to the System Design and Management Program on February 2015, in Partial Fulfillment of the Requirements for the degree of Master of Science in Engineering and Management. Abstract Industry platforms in the software sector are increasingly being developed in open source. Firms seeking to position themselves as platform leaders with such technologies must find ways of operating within the unique constraints of open source development. This thesis aims to understand those challenges by analyzing the Android and Hadoop ecosystems through an augmented version of Porter’s Five Forces framework proposed by Intel’s Andrew Grove. The analysis finds that platform contenders in open source behave differently depending on whether they focus on competing against alternative platforms or alternative providers of the same platform as rivals. This focus informs key decisions that the firm takes, including how it interacts with complementors and its approach to innovation. Due to the fact that open source vendors tend to lack unilateral authority over technology decisions, they can only seek to influence the behavior of the ecosystem by securing key relationships in the value network. In particular, they must secure the right engineering talent, access to key complements and superior paths to the customer. The research highlights some of the factors and tactics platform contenders in Hadoop and Android considered in acquiring these relationships. The open nature of FOSS (Free and Open Source Software) also allow new technologies to emerge and change the definition of the platform’s boundaries. This creates a further strategic challenge for open source platform contenders. Keywords: platform strategy, platform leadership, open source software, Hadoop, Android Thesis Supervisor: Michael Cusumano Title: SMR Distinguished Professor of Management & Engineering Systems MIT Sloan School of Management i Platform Leadership in Open Source Software This page is intentionally left blank. ii Acknowledgement This thesis was made possible by a number of individuals who generously shared their time and expertise with me. There are only a few names on the cover of this document, but the content within contains the wisdom and contributions of so many more. I would especially like to thank Professor Michael Cusumano for his guidance and advice throughout the entire journey. The breadth of his knowledge and depth of his insights on all things related to platform strategy is simultaneously humbling and inspiring. My understanding of the Hadoop ecosystem was greatly informed by a number of enlightening conversations I’ve had with the thought leaders of that market. I am tremendously grateful to Rob Bearden (CEO of Hortonworks), Ron Kasabian (GM of Big Data at Intel) and Mike Olson (Founder and Chief Strategy Officer of Cloudera) for taking time to indulge the curiosity of a student. The case study that sits at the heart of this thesis would not have been possible without their assistance. The time I spent at MIT was also enabled by the fantastic support I received from my colleagues in SAP’s Analytics Division. In particular, I would like to thank Jesse Calderon, Don Wakefield and Michael Reh for their sponsorship and encouragement during the past two years. Though I am no longer a part of SAP, I will take the many things I’ve learned from these leaders forward with me. The same goes to Pat Hale and the fantastic staff of the SDM program. Finally, I would like to thank my family for their unwavering support and many sacrifices that made it possible for me to complete my studies. To my amazing wife Sharon and the active bundle of joy that she is currently carrying in her tummy: Completion of this program is made even sweeter by the knowledge that I now have more time to spend with you. Your love is a true blessing from God. iii Platform Leadership in Open Source Software This page is intentionally left blank. iv Table of Content Abstract ................................................................................................................................ i Acknowledgement ............................................................................................................. iii Table of Content .................................................................................................................. v Introduction ......................................................................................................................... 1 Approach and Structure ...................................................................................................... 3 Chapter 1 – Literature Review ............................................................................................ 5 Network Effects .............................................................................................................. 5 Product vs. Industry Platforms ....................................................................................... 7 Two-Sided Markets ......................................................................................................... 7 Topology of Platform Roles and Openness in a Platform-Mediated Network ............... 9 Platform Leadership and the “Four Levers” Framework ...............................................11 Lever 1: The Scope of the Firm .................................................................................11 Lever 2: Product Technology .................................................................................... 12 Lever 3: External Relationships ................................................................................ 12 Lever 4: Internal Organization .................................................................................. 13 Platform Establishment and Displacement ................................................................... 13 Open Source Software .................................................................................................. 16 Commercial Interest in Community-driven Development ........................................... 16 Related works on Commercial Open Source ................................................................ 20 Chapter 2 – Strategic Considerations for Open Source Leadership ................................. 23 IBM and Eclipse ........................................................................................................... 24 The Definition of Open Source Leadership .................................................................. 26 v Platform Leadership in Open Source Software Google and Android ...................................................................................................... 30 Rivalry – Inter-network vs. Intra-network Competition ............................................... 34 Suppliers – Securing the Upstream Value Chain .......................................................... 38 Complementors – Identifying and Securing Critical Complements ............................. 43 Buyers – Controlling the Path to the Customer ............................................................ 45 Substitutes and New Entrants – The Threat of Shifting Platform Boundaries ............. 49 Chapter 3 – A Case Study on Hadoop ............................................................................... 57 History and Origins ....................................................................................................... 57 Hadoop and the Big Data Phenomenon ........................................................................ 59 The Relational Database ........................................................................................... 60 Hadoop to the Rescue ............................................................................................... 61 Architectural Overview ................................................................................................. 64 Distributed Storage ....................................................................................................... 65 Job Managers and Coordinators................................................................................ 65 Distributed Processing Frameworks ......................................................................... 66 Scripting Engines, Libraries and SQL on Hadoop .................................................... 68 Administration and Management .............................................................................. 70 Market Overview .......................................................................................................... 72 Strategic Factors affecting Platform Leadership within the Hadoop Ecosystem.......... 78 Rivalry - Inter-network vs. Intra-network Competition ............................................ 79 Suppliers - Securing the Upstream Value Chain ....................................................... 83 Complementors - Identifying and Securing Critical Complements .......................... 89 Buyers - Controlling the Path to the Customer ......................................................... 91 Substitutes and New Entrants - The Threat of Shifting Platform Boundaries .......... 94 vi Table of Content Chapter 4 - Conclusion ..................................................................................................... 99 Areas of Further Research ...................................................................................... 100 Appendix ......................................................................................................................... 101 List of Figures ................................................................................................................. 121 List of Tables ................................................................................................................... 123 References ....................................................................................................................... 125 vii Platform Leadership in Open Source Software This page is intentionally left blank. . viii Introduction For the first decade of its existence, the idea of publically sharing source code appeared to be fundamentally incompatible with the idea of building software for profit. Richard Stallman, the founder of the GNU Project and a pioneer of the “free and open source software” (“FOSS”) movement, framed the decision of developing proprietary versus open source software as a “stark moral choice” between individual profit and the greater good [1]. Despite subsequent attempts by a multitude of individuals (including Stallman himself) to clarify that the term ‘free software’ refers to the ability to use or modify a product freely and not to its price, profit-seeking software firms in the late 1980s and the early 1990s largely opted for proprietary development models in order to maximize appropriability. Firms such as Microsoft, Oracle and SAP provided real-world evidence for the profitability of the proprietary model by becoming some of the most valuable companies in the world. These vendors’ extraordinary successes can be partially attributed to their ownership of proprietary industry platforms in operating systems, database management systems and applications respectively. Auto-catalyzed by powerful network effects, tremendously valuable business networks formed around the technologies provided by these vendors, and these firms leveraged their exclusive ownership of the core intellectual property to capture a disproportionally large amount of the value generated in these ecosystems. In fact, some of these firms leveraged their dominant platform positions so effectively that they were investigated for antitrust violations [2]. The success of these firms have captured the attention of academics and corporations alike, and a substantial amount of effort has been put into understanding how aspiring platform providers can replicate their successes. As a result, concepts such as ‘enveloping’, ‘coring’ and ‘tipping’ entered the business lexicon and the strategic management of ‘platform competition’ became a core concern of vendors competing in diverse technology markets ranging from mature areas like application middleware to the nascent battlegrounds of mobile operating systems. In many of these markets, open source technologies compete with the offerings of commercial vendors, with a prominent example being Linux in the operating system space. The success of these open source platforms have been greatly varied, but as end customers and complement creators become aware of the powerful bargaining position held by proprietary platform vendors, they are seeking to increase the substitutability of the platform by embracing 1 Platform Leadership in Open Source Software open source technology. This behavior is especially common in markets where the cost of multihoming (concurrently adopting more than one platform) is high. In enterprise software, some industry observers have pointed out that nearly all dominant ‘infrastructure’ technologies that emerged in the last ten years have been open source [3]. Consequently, commercial platform vendors are recognizing that the open source model is a powerful and occasionally necessary mean to substantially increase the probability that a given platform gains widespread adoption. Table 1 enumerates some recent examples of leading open source platforms in significant markets created by corporate entities with the intent of commercial value extraction. Market Platform Technology (Commercial Founder) Mobile Operating Systems Android (Google), Sailfish OS (Jolla), Tizen (Samsung, Intel) Cloud Platforms CloudStack (VMOps), OpenStack (Rackspace), Eucalyptus (Eucalyptus), SmartOS (Joyent) Content Management Wordpress (Automattic), Drupal (Acquia), Alfresco (Alfresco) Data Management MySQL (MySQL AB), MongoDB (MongoDB), BigCouch (Cloudant), Riak (Basho), Redis (VMWare, Pivotal), Impala (Cloudera), Talend (Talend) Application Middleware / JBoss (JBoss), SpringSource (Springsource), Zend Framework (Zend) Framework Table 1- Open source platforms by commercial firms Given that platform technologies are increasingly being developed with the open source model, a firm seeking to establish a position itself as a platform leader in a given space must find a way to operate within the unique constraints and operating context of open source development. Pre-existing frameworks for platform leadership management, such as Cusumano and Gawer’s “Four Levers”, were predicated on the assumption that key decisions such as the degree of architectural openness were within the platform provider’s locus of control. These assumptions are invalidated in the FOSS world, and firms seeking to orchestrate the trajectory of a given platform-based ecosystem need to find other means of exerting their influence; this research paper is motivated by that need. 2 Approach and Structure This thesis is divided into four chapters. In the first chapter, existing research on platform strategy and open source business models is reviewed in order to establish the vocabulary and concepts required for analyzing the topic. Those familiar with concepts around network effects and existing literature on platform strategies are encouraged to skim through this section. The second chapter presents a short description of “open source platform leadership” is presented for the purpose of framing the discussion to follow along with a composite framework for understanding the strategies of open source platform vendors, inspired by Andrew Grove’s Six Forces Model [4]. This framework is illustrated by a case study of Google’s Android platform. This framework is also used to structure the case study on Apache Hadoop in the third chapter. An introduction to the history of Hadoop, its relevance to the modern technology marketplace and an overview view of its architecture are then presented along with the profiles of key ecosystem players. The ecosystem is then analyzed using the framework established in Chapter 2. The selection of Hadoop was motivated by the significant technological and economic impact of this platform technology as well as the author’s personal interest in the subject matter. The inputs into the case study analysis include secondary research data from existing works, as well as original data drawn through direct discussions with key influencers within the industry. The intent of the Hadoop case study is not to project the future of the marketplace, but rather to understand the strategies of individual vendors in order to appreciate the logic behind their behavior. 3 Platform Leadership in Open Source Software This page is intentionally left blank. 4 Chapter 1 – Literature Review Network Effects The concept of network effects originated in the telecommunication industry and was formalized in economic models in the early 1970’s by Bell Labs researcher Roland Artle, Christian Averous and Jeffrey Rohlfs [5][6]. These economists identified a unique type of consumption externality that occurs in the telecommunication industry known as network effects or network externalities. They noted that when a customer chooses to ‘consume’ a specific networked product connecting to that network, that decision does not only bring value to that customer but also to all the other members of that network who were external to that consumption decision. In a paper written approximately a decade later, Michael Katz and Carl Shapiro advanced the concept to industries beyond telecommunication. The essence of the concept is that the value of any given product is not always strictly a function of the product’s intrinsic quality but that there are many markets beyond telecommunication where “the utility that a given user derives from the good depends upon the number of other users who are in the same ‘network’ as is he or she.” [7]. The network referenced in the aforementioned quote does not refer only to connections between end users, but also the connections between interdependent firms offering compatible, complementary products and services for those end consumers. For many such networks, existing consumers participating in the network do not directly benefit when new consumers join the network, but they benefit indirectly as more complementary firms are attracted to the network by these additional consumers. These new complementary firms offer additional services or capabilities that increase the value of the network, benefiting the original consumers. The illustrative example that the researchers used was the personal computer market; the more users adopt a given computer platform, the more likely software producers will develop for that platform, bringing more value to the existing users. This phenomenon is also known as increasing returns to adoption. It is worth noting that the effect of network externality is bidirectional in that it can catalyze both adoption as well as abandonment of a platform. Users fleeing a platform reduce its value, increasing the relative attractiveness of alternative platforms and thereby accelerating deflection. Figure 1 provides a simple system dynamics model illustrating this phenomenon. 5 Platform Leadership in Open Source Software R Network Effect (-) Platform Participants Abandonment Adoption + R Network Effect (+) Figure 1 – A simple system dynamics model illustrating the self-reinforcing behaviors of platform adoption and abandonment due to network externalities (original creation) Shapiro and Katz also produced a formal economic model of “network competition”, which provided a basis for understanding the competitive dynamics of markets where multiple alternative networks compete for the same customer. In the aforementioned personal computer market, the Windows and Apple ecosystems compete effectively for the same market of personal computer users. One major insight captured in Shapiro and Katz’s model was the fact that consumers base their adoption or purchase decision on the expected size of a given network and not just the current size of the network [7]. The above phenomena combine to create in a demand-side economies of scale, resulting in natural market equilibriums where the dominant winner takes most, if not all, of industry market share[8]–[10]. This autocatalytic nature of network effects is the reason that firms compete for the position of being the provider of industry platforms. 6 Chapter 1 – Literature Review Product vs. Industry Platforms The term “platform” is a heavily overloaded word in the context of product development. The term is often used to reference common componentry that is used to build a portfolio of related products (or “product family”). The motivation behind the creation of such platforms is varied, but generally revolves around the idea that efficiencies can be gained by sharing the common costs, risks and benefits of development and manufacturing across multiple products. Examples of such platforms can be found abundantly in the automobile industry, where the vast majority of vendors offer a large number of product variants based on a much smaller number of base platforms. For the purpose of disambiguation, researchers refer to this concept as “product platform”. According to de Weck, Suh and Chang, the design of a product platform is a firm-internal optimization problem; a firm must search through the space of platform design possibilities in order to identify a design that maximizes the cost savings of component reuse, while simultaneously minimizing the compromises associated with component sharing [11]. In contrast, the search space for the design of an industry platform is far larger by definition, and the analysis of such platforms is not bounded to a single firm. Industry platforms are the technological infrastructure that allows independently evolving goods and services from different firms to be connected together into an interdependent system that creates value [12]. This thesis is directed at studying the strategies of firms attempting to develop such industry platforms and consequently all subsequent references to “platforms” are made in this vein. Two-Sided Markets Platform-mediated technology ecosystems are often modeled as two-sided markets. On one side of the platform sits customers (e.g. Personal Computer users), who are trying to consume the combined solution that consist of the platform (e.g. Windows) and complementary products (e.g. Application Software) offered by the suppliers residing on the other side. This model of an ecosystem allows for more precise characterization of the different types of network externalities occur within a platform-based ecosystem, allowing scholars to differentiate between “same-side” and “cross-side” network effects. In general, “cross-side” effects are generally reinforcing. The value of the platform increases for a consumer when there are additional complementors that join the network, and vice versa. In contrast, “same-side” effects are 7 Platform Leadership in Open Source Software typically reinforcing on the consumer side and balancing on the complementors’ side. Additional consumers increase the viability of the platform and therefore its value to other consumers. However, additional complement suppliers increase the level of competition within the platform and diminishes its value for other suppliers. Figure 2 provides a simple system dynamic model illustrating these different forces. Potential Customers Adoption ++ Platform Customers R Same-Side Network Effect R Cross-Side Network Effect + Platform Complementors B Complementor Platform Adoption - Potential Complementors Complementor Competition Figure 2 - A simple system dynamics model illustrating the two different types of network effect at work in a two-sided platform (original creation) Due to cross-side network effects, vendors on a given platform may welcome the entrance of additional competition in the form of other complementary vendors. This can occur if the entrance of additional vendors increases the viability and attractiveness of the platform and these gains sufficiently offset the effects of additional competition. This is especially true in cases where there are barriers preventing complementors from “multi-homing” on multiple platforms and the intensity of network-level competition between platforms exceeds that of individual complementors. As an illustrative example, software vendors invested in developing natively on Blackberry’s OS10 mobile operating system are likely to welcome additional vendors to develop apps for that platform, as a more vibrant app ecosystem is expected to offset additional competition that they would face within the Blackberry App Store. Consistent with this observation, Kevin Boudreau found that an increase in the variety of software application producers within a mobile application ecosystem “increases innovation incentives” due to 8 Chapter 1 – Literature Review network effects [13]. This phenomenon echoes Shapiro and Katz’s earlier research, which showed that under network competition, a monopolist complement supplier within a given network would counter-intuitively benefit from the entry of additional complement suppliers. However, Boudreau noted that an increase in similar types of application producers actually diminishes the motivation of developers as they become “crowded out” of the market. The two-sided model also illustrates that platform providers must find ways of attracting participants to both sides of the platform in order for the ecosystem to become viable. The study of how platform vendors manage this has been a subject of great interest. One strategy is crossside subsidies. A two-sided platform provider may focus its monetization strategy on a single side and opt to offer “free” or heavily subsidized goods and services on the other. Van Alstyne and Parker showed that by lowering prices on one side of the network, platform providers can change the shape of the demand curve on the other side, resulting in a net increase in overall firm profits. As each “side” of the platform represents markets in their own right, this results in an interesting phenomenon where an effective monopolist in a given market may volunteer to lower its price below its marginal cost in order to maximize profits. For example, video game platform vendors like Sony or Microsoft often choose to offer their software development toolkits to video game producers for free (or close to free), despite the fact that they are effectively the only supplier for that essential ingredient to video game production [14]. Conversely, price increases on one-side of the network, even in a price-inelastic market, may have the counter-intuitive effect of lowering organizational profit due to its negative impact on demand on the other side; the cross-side implications of price changes in a two-sided network makes pricing a complicated matter. Topology of Platform Roles and Openness in a Platform-Mediated Network Eisenmann, Parker and Van Alstyne identified four distinct roles that network participants can play in participating in a platform-mediated network [15]. Beyond identifying “demandside” and “supply-side” platform users, which refer to consumers and complement providers respectively, the trio further differentiated between “platform providers” and “platform sponsors” (Figure 3). The platform provider acts as the “primary point of contact” for users on both sides of the platform while the platform sponsor is responsible for determining which parties may participate in the network. For example, banks such as Citi or Barclays act as platform providers 9 Platform Leadership in Open Source Software for credit payment networks, whereas Visa itself acts as the platform sponsor. The trio asserted that platforms differ in their degree of openness to these different roles. Based on the categorization provided by this group, the “sponsor” role of Linux are occupied by the open source community and therefore highly open. Table 2 enumerates a few select computing platforms and the openness of their various platform roles as identified by Eisenmann, Parker and Van Alstyne. Linux Windows Macintosh iPhone Demand-side Platform User Open Open Open Open Supply-side Platform User Open Open Open Closed Platform Provider Open Open Closed Closed Platform Sponsor Open Closed Closed Closed Table 2 - Comparison of openness by role in platform-mediated networks. Reproduced. [15] Figure 3 –Roles and Relationships in a Platform-Mediated Network according to Parker, Eisenmann and Van Alstyne (Reproduction of Figure 2) [15]. An open source platform market can be viewed as a market where the role of the platform sponsor is played by an open source community. 10 Chapter 1 – Literature Review Platform Leadership and the “Four Levers” Framework During its explosive growth phase in the nineties, Microsoft and Intel jointly led the platform powering the personal computer (PC) market. Annabelle Gawer hypothesized in her doctoral thesis that Intel’s continued success in a highly fragmented and vertically disintegrated market stemmed from a highly evolved practice of fostering and managing the creation of complementary products in the personal computer ecosystem. This sophisticated platform management practice enabled Intel to establish itself as one of the primary beneficiaries of the growth in the PC ecosystem. She observed that in a rapidly changing technological landscape like that of the personal computer market, platform providers cannot simply seek to leverage cross-side network effects by maximizing the supply of complementary products, but rather to ensure that complementors are “innovating in ways that are favorable” to their platform. For this reason, Gawer defined platform leadership as “a firm’s ability to influence the development of a large number of complementary products by almost all other firms in their industry”[16]. This definition of platform leadership is used throughout this paper. It is worth noting that this definition is not restricted to a specific platform “role”; a platform leader can play any of the four roles identified by Eisenmann, Van Alystne and Parker. In 2002, Gawer further validated and elaborated this work with her thesis supervisor Michael Cusumano, by categorizing Intel’s activities into four aspects of platform leadership management. The pair called this framework “the Four Levers of Platform Leadership” [10]. An overview of the four levers identified is presented in the sections below. Lever 1: The Scope of the Firm Platform firms must continuously decide which portions of the overall system to deliver itself and which to leave for complementary vendors in the ecosystem. This is a continuous process as the platform vendor must introspect its own capabilities and the dynamics of the marketplace (including the behavior of platform competitors) and adjust its approach. As an example, Microsoft has always been willing to directly compete with its software application partners due to its immense software development capability, but had left hardware largely to its partners. However, the company made a drastic change to this approach in 2013 by acquiring Nokia’s mobile phone business for $7.2 Billion USD. Understanding the motivations and 11 Platform Leadership in Open Source Software decision-making process of this strategic change is beyond the scope of this paper, but it illustrates the dynamic nature of firm-scope management. Lever 2: Product Technology A platform leader’s decisions regarding the design of its technology’s architecture, interfaces and intellectual property management significantly affect the nature of innovation that participants of its ecosystem are able to contribute. A modular architecture enables contributions from complementors and is generally preferred over “integrated” architectures with low substitutability of components from the perspective of innovation enablement. However, platform firms must determine the openness of their platform interfaces, balancing the competitive advantages offered by exclusive proprietary access to ‘core’ platform functionality with the need to encourage complementors. A similar balancing act occurs with regards to the management of intellectual property. Generally speaking, the more open a platform leader is with its intellectual property, the more vibrant is its ecosystem. As with the previous lever, the management of product technology is also a continuous process, though it is worth noting that it is typically more difficult for a firm to restrict an open policy than the converse. Lever 3: External Relationships Beyond making internal decisions regarding the scope and nature of its technologies, platform leaders must also orchestrate the actions of complementary vendors in a manner that is favorable to the platform. Gawer and Cusumano found that Intel was especially mature in this aspect of platform leadership, acquiring organizational capabilities and making substantial investments to build consensus and control of platform decision making, as well as encouraging the right balance of collaboration and competition between complementors. As an example, Intel shared with Gawer that it employs a unique strategy when attempting to drive the definition of new interfaces for the PC platform. It attempts to create “momentum” behind new interface standards by initiating the design process with a small interest group of the most influential players within the ecosystem before involving the larger ecosystem. This approach helps to relieve the “design by committee” challenges of completely open and democratic processes while maintaining the benefits of having external contribution and validation. 12 Chapter 1 – Literature Review Lever 4: Internal Organization The final lever Cusumano and Gawer identified pertains to the internal organizational structure and processes that a platform leader puts in place to manage the inherent tension of managing collaboration and competition. At Intel, this began with the identification and differentiation of the competing objectives of the company. At Intel, “Job 1” refers to the core organizational objective to sell more microprocessors, while “Job 2” refers to the desire to compete directly in complementary businesses, while “Job 3” refers to the task of growing new lines of business that may not be directly related to the core microprocessor business. By acknowledging the inherent conflicts that these objectives create and by providing a vocabulary for discussing them, Intel enables its management team to manage this tension proactively and consciously. Beyond this, Intel also created organizational groups that were dedicated to these different objectives. This not only served to focus the internal groups but also creates a level of organizational separation that alleviates the conflicts of interests that external partners perceive. Platform Establishment and Displacement While Gawer and Cusumano focused their studies on the activities required to manage and sustain the position of a platform leader, others focused their attention on the process of establishing and displacing industry platforms. The early works of Rohlfs had established that even potentially viable networks will naturally be attracted to a stable equilibrium of zero participants unless a critical mass of participants is reached [5]. Evans and Schmalensee extended the model to two-sided platform businesses and illustrated the challenge of reaching the critical threshold on both sides of the platform. Depending on whether it is easier for potential participants on a given side to join the platform or for existing participants to drop off, firms aiming to launch two-sided platforms may find themselves in a position to subsidize participation on both sides of their network in order to reach critical mass. The pair also found that design decisions that reduce the resistance to participation on both sides of the network not only lowers the critical mass required, but also increases the equilibrium adoption of the platform once established. The topic of platform displacement closely relates to the research on architectural innovation and its impact on incumbent leaders completed by Henderson and Clark [17]. Prior to their research, the prevailing understanding was that while all innovations create opportunity for 13 Platform Leadership in Open Source Software new entrants to enter the market and displace the incumbent, only ‘radical’ innovations that eradicated the technological competencies of incumbents create conditions favoring the new entrant. In studying the then-nascent semi-conductor industry, Henderson and Clark observed that evaluating the disruptive nature of an innovation based on the degree of technological change was inadequate, as there were numerous occasions where seemingly incremental technological changes resulted in the displacement of industry leaders. Instead, Henderson and Clark found that architectural innovations – innovations that impacted the manner in which components of a product system connected together – tended to be significantly more disruptive to incumbent firms than technological innovations at the component level, regardless of the degree of ‘radicalness’. The pair hypothesized that architectural innovations tend to be more disruptive to the established firm as architectural knowledge tends to be captured in a firm’s “structure and information-processing processes” which is difficult for a successful firm to recognize and address. Similarly, platform displacement is also more likely when an innovation allows for a reconfiguration of how the participants interact with one another. An alternative path to platform displacement identified by Eisenmann, Parker and Van Alstyne is the strategy of platform envelopment. Envelopment is interesting as it “provides a mechanism for platform leadership change that does not require breakthrough innovation or Schumpeterian creative destruction” [18]. The authors found that while network effects make it difficult for a new platform entrant to displace an established platform, the incumbent may be displaced through envelopment when the capabilities of its platform become bundled as a part of an enveloping platform serving an adjacent market. In other words, the new entrant or attacker can expand the scope of the competition and leverage economies of scope and scale on the demand or supply side in order to create leverage for displacing an incumbent vendor. Eisenmann, Parker and Van Alstyne cites the example of Microsoft’s successful “attack” on RealNetwork’s dominant media streaming platform in the late 1990s by bundling streaming services into its Windows NT server offering. The authors established a taxonomy for envelopment attacks consisting of three major categories: conglomeration, intermodal, and foreclosure (Table 3). In reaction to such attacks, platform leaders can seek to match the new bundle, pursue legal protection, or exit the market if it cannot match the new entrant on the new basis of competition. 14 Chapter 1 – Literature Review Attack Type Description and Example Conglomeration The attacker joins functionally unrelated platforms together to create a new bundle in order to leverage the economies of scope and scale to its competitive advantage. e.g. Cable and telephone television firms reciprocally bundling TV, phone, internet services to attack each other. Intermodal The attacker bundles two weak substitutes that deliver the same functionality, but using different modalities and offering a single composite platforms that negates the user’s need to choose between different modes. e.g. Netflix bundling in DVD-by-mail and streaming delivery as a single product. Foreclosure The attacker bundles two complementary capabilities together to create a synergistic offering that is superior to the individual products. e.g. LinkedIn’s bundling of social network and job matching into a single platform. Table 3 - Taxonomy of Envelopment Attacks 15 Platform Leadership in Open Source Software Open Source Software Most scholars trace the origins of open source to the Free Software Movement started by Richard Stallman when he incepted the GNU Project [1]. Responding to a perceived increase in the limitations imposed by proprietary software vendors, the movement sought to restore the four “essential freedoms” of computer users through the creation of “free software”. According to Stallman, these freedoms are: 0. The freedom to run the program as you wish, for any purpose. 1. The freedom to study how the program works, and change it so it does your computing as you wish. Access to the source code is a precondition for this. 2. The freedom to redistribute copies so you can help your neighbor. 3. The freedom to distribute copies of your modified versions to others. By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this. In the beginning of the movement’s history, the development of such software was primarily of interest to academics and sophisticated individuals whose freedoms were infringed. Despite Stallman’s continuous reminder to think of free software as “‘free speech,’ not ‘free beer.’”, few commercial enterprises engaged in the creation of “free” software. It was widely perceived that the act of sharing intellectual property was counterproductive to the objective of profit extraction. In his widely cited paper “Profiting from Innovation”, David Teece argues that innovators are best positioned to capture the value of their inventions if their intellectual property is legally protected or “the nature of the product is such that trade secrets effectively deny imitators access to the relevant knowledge”[19]; this condition is known as a tight appropriability regime. The act of sharing source code increases access to potentially differentiating knowledge, and therefore reduces appropriability. Given that the majority of profit-seeking software firms view themselves as innovators trying to capitalize on their creations, such firms tended to view the “free software" world with skepticism. Commercial Interest in Community-driven Development Commercial interest in community-driven development began to shift in the mid-nineties with the emergence of Linux. University of Helsinki student Linus Torvald began developing 16 Chapter 1 – Literature Review Linux in 1991 as a personal project to maximize the capabilities of his specific hardware at the time. The popularity of the project exploded shortly after Torvald shared his work and adopted the GNU General Public License in 1992 (the original license that shipped with Linux explicitly forbade commercialization). Today, Linux does not only power the personal computers of dedicated hobbyist like Torvald, but is also competing successfully against proprietary commercial systems in devices ranging from mobile handhelds to the most powerful supercomputers in the world (Figure 4). Worldwide Server Operating Environments , 2012 Mobile Operating Environment Shipments, 2012 OS of the World's Top 500 Supercomputers, 2014 Figure 4 - Linux (in Orange) market share in various computing segments [20], [21] Commercial interest in Linux and open source was initially motivated by the surprising quality and effectiveness of the community-driven development model. As Di Bona, Octman and Stone observed, the Linux volunteer community “produced a piece of software that would otherwise require the might and resources of someone like Microsoft to create”[22]. Aided by the rapid advance in internet collaboration technologies, the open source model managed to reconfigure the requirements of production so substantially that it eliminated what had previously been regarded as a natural monopoly. In his popular 1999 essay “the Cathedral and the Bazaar”, Eric Raymond observed that the development style employed in the development of Linux was unique even within the open source community and hypothesized that this development style was a major reason for the 17 Platform Leadership in Open Source Software project’s success. Chief amongst Raymond’s finding was that Linus Torvald focused his attention on keeping his contributors engaged and worked to ensure constant activity within the community (i.e. operating a bazaar) rather than enforcing consistency and architectural elegance (i.e. building a cathedral). Moreover, Raymond asserted that the project’s frequent release cycle and large, active user base assure the project’s quality more effectively than traditional software engineering practices. Raymond attributed much of Linux’s success to the rapid progress enabled by this “Bazaar” model and attempted to validate the effectiveness of this approach by consciously developing his own project in the same manner. Raymond’s findings influenced Netscape to release the code for its then-popular web browser Communicator under an open source license with the hope of leveraging the development capabilities of the open source community as a competitive advantage over Microsoft. Unfortunately, Netscape’s effort failed to garner sufficient attention from the community and it was eventually acquired by AOL before eventually being disbanded. Raymond attribute Netscape’s failure to engage the community to their lackluster efforts in removing the barriers to entry for contributors. As an example, contributors of the product needed a license for a third-party UI library (Motif) just to work on the product during its first year, which created a significant barrier for participation. While Netscape failed to leverage the open source community as an engine for commercial success, the challenges of their initial open source project led to the creation of the Mozilla project, which did manage to attract significant community attention and resulted in the popular browser Mozilla Firefox. Despite the critical role of the community development model in progressing a number of foundational technologies in the modern technology landscape, commercial interest in free and open source development remained modest until the emergence of the Open Source Initiative (OSI) and the highly publicized growth of Redhat, Inc in the late 90’s. Up until the formation of the OSI, the commercial software development world did not delineate between the moral and philosophical ideals of “Free Software” and the more pragmatic motivations behind communitydriven development. Founded by Eric Raymond and Bruce Perens in late February 1998, the Open Source Initiative was created shortly after Netscape’s release of its proprietary code earlier that month. The pair wanted to use the highly publicized event to advocate for “the superiority 18 Chapter 1 – Literature Review of an open development model” [23]. The term “open source” was coined at that time in order to avoid the “the philosophically- and politically-focused label of ‘free software’” [23]. The OSI created a pragmatic definition of “open source” software free of the constraints and judgmental ideology of the Free Software Foundation, instead focusing on practically capturing the requirements of licenses that should be considered “open” (Table 4). In particular, the definition of open source software explicitly enables the possibility of deriving “paid” software from open source software, which is forbidden for “Free Software”. Criteria Description Free Redistribution The license must allow for free or paid redistribution of the software without royalties or other fees. Source Code Source code for the software must be reasonably available and the license must allow for its redistribution. Derived Works “The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.” Integrity of the Author’s “The license must explicitly permit distribution of software built from modified source code.” Source Code No Discrimination Against Persons or “The license must not discriminate against any person or group of persons.” Groups No Discrimination Against Fields of Endeavor “The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.” Distribution of License “The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties.” License Must Not Be “The rights attached to the program must not depend on the program's being part of a particular software distribution.” Specific to a Product 19 Platform Leadership in Open Source Software License Must Not Restrict Other Software License Must Be Technology-Neutral “The license must not place restrictions on other software that is distributed along with the licensed software.” “No provision of the license may be predicated on any individual technology or style of interface.” Table 4 – A description of the ten criteria of open source software as defined by the Open Source Initiative. Modified and Adapted from http://opensource.org/osd This “rebranding” effort and the pragmatic, business-case driven approach of the OSI can partially be credited for the increased interest in commercial open source development over the past decade, though the success of Linux (and Redhat in particular) likely also played a pivotal role. The study of open source business models within the research community has correspondingly increased. Related works on Commercial Open Source In his 2005 paper, Sandeep Krishnamurthy attempted to categorize the business models of open source firms into four distinct categories: (1) software distributors, (2) software producers following the GPL model (firms leveraging open source components to create derivative products that was also open source), (3) software producers not following the GPL model (firms leveraging open source components to create derivative products that was proprietary), and () third-party service providers [24]. Krishnamurthy further summarized that the primary appeal of open source products to corporations stem from the perception of superior performance, lowered adoption risk and lower total cost of ownership. Finally, he identified community support, presence of proprietary or open source competition, relative competiveness and marketing as the key factors affecting the profitability of open source firms. In a Research Policy article on the topic of “melding proprietary and open source platform strategies”, Adam West analyzed three major software platform vendors’ explorations with community-driven development in order to understanding the strategies and motivations for participation in open source [25]. West observed that the modern computer industry evolved from a vertically integrated market dominated by vendors who held end-to-end control of the “stack” to a market comprised of horizontally dominant platform firms, exemplified by 20 Chapter 1 – Literature Review Microsoft and Intel. Focusing on the operating system as the platform of study, he observed that industry interest in open source was motivated by a desire of the leading contenders in the industry to challenge Microsoft’s dominant Windows platform. West chronicled the differing efforts of IBM, Apple and Sun Microsystems as they engaged the open source movement and identified that their participation in open source were motivated by different intentions. Table 5 summarizes West’s discussions on the motivations of these different vendors. Vendor Open Source Projects Intentions Apple FreeBSD OpenDarwin Apple’s primary intentions for participating in open source was to leverage and reuse some of the market leading components that were being built in the open source community (in particular the Free BSD project). IBM Apache As a part of shifting its business focus towards applications and application platforms, IBM tried to reduce the control that Microsoft held as the platform leader by pushing the computing industry towards open standards while positioning itself as the leading integrator of technology. Its involvement in projects like Apache and Eclipse were means of accelerating the development of its proprietary products. Eclipse Linux Sun Microsystems Java Open Office Linux Sun’s motivation for engaging in open source was primarily to leverage the “horsepower” of the open source community to accelerate the development of alternate platforms to challenge Microsoft’s leadership in the market of application frameworks (i.e. .NET vs. Java) and office productivity (i.e. Office vs. OpenOffice). Table 5 - Apple, IBM and Sun Microsystem's involvement in open source and their motivations for participating. Summarized from the contents of [25]. West summarizes that there are two broad approaches to blending proprietary and open source strategies. Firstly, a firm can choose to concede the more “commoditized” layers of the platform to the open source community in order to focus their investments in differentiating layers. Apple’s decision to adopt a variant of the open source FreeBSD kernel for its operating 21 Platform Leadership in Open Source Software and continued development of a unique user interface shell exemplifies this approach. The second approach is to disclose technologies and intellectual property to promote indirect network effects. Both IBM and Sun’s experimentations with open source reflect this latter approach. West’s observations regarding IBM’s motivations behind its open source strategy are supported by Cepek, Frank et al. in their 2005 article in the firm’s own IBM System Journal entitled “A history of IBM’s open source involvement and strategy” [26]. The paper, written by IBM’s own employees who were involved in the various efforts, highlighted IBM’s recognition that the open source movement was a business reality within its industry and recalls its efforts in harnessing the movement for its own strategic intentions. The authors summarizes IBM’s strategic intentions for involvement in open source as: (1) encouraging the use of “open source implementations of open standard”, (2) fostering greater variety and choice and (3) enhancing IBM’s mindshare. Nicolas Economides and Evangelos Katasmakas created economic models for modeling the competition between proprietary and open source platforms in their 2006 article in Management Science [27]. The pair observed that a proprietary platform vendor in a multi-sided market could find it profitable to set a price below its marginal cost on one side in order to maximize profits on all sides. The models found that markets supported by proprietary platforms are more profitable than those supported by open source platforms, though the variety of available complements is generally higher in open source platforms. However, this finding was made with the assumption that open source platform profits (the profits derived from the selling of the platform itself) is always zero. 22 Chapter 2 – Strategic Considerations for Open Source Leadership What does it mean to be an “open source platform leader”? A naïve definition would be a simple literal interpretation of the two clauses from the phrase’s bisection – a “platform leader” that leverages “open source” technologies. However, the use of open source technology is so prevalent today that there are scarcely any commercial software firms that do not leverage open source in some form, and consequently this definition is too broad to be useful. Even Microsoft, the canonical proprietary platform vendor, utilized open source libraries in the delivery of its Windows NT operating system [28]. Similarly, Cisco Systems delivers a variety of hardware devices that run embedded versions of Linux and is generally considered a “platform leader” by scholars like Cusumano and Gawer. Clearly, the moniker of “open source platform leader” appears to be ill-fitting for these firms that appear to epitomize proprietary platform leadership. A more restrictive definition of the term “open source platform leader” would be “a leading provider of open source platforms”. In other words, candidacy for “open source platform leadership” is restricted to platform providers who specifically create open source products. This definition roughly aligns with Krishnamurthy’s second category of commercial software vendors, which he called “software producers following the GPL model”, broadened to include other types of open source licenses, but with the additional constraint of the software being a platform product. At first glance, this appears to be a more appropriate definition, as it would remove the above counterexamples such as Microsoft or Cisco from candidacy without disqualifying obvious candidates like Red Hat. However, this definition is actually too narrow as it excludes many firms that opt to utilize open source strategically in order to drive the adoption of their platform products, even proprietary ones. One excellent example of such a firm is IBM and its inception of the Eclipse open source project in the software application platform space. 23 Platform Leadership in Open Source Software IBM and Eclipse In 2001, IBM open sourced its Eclipse technology and founded the Eclipse consortium in order to drive the adoption of the Java-based application platform and integrated development environment (IDE) as an alternative to Microsoft’s .NET and Visual Studio stack [29]. After utilizing the resources it had obtained from its acquisition of Object Technologies International (OTI) to develop an internal product platform to improve the development efficiency and consistency of its own application development, IBM decided to leverage OTI’s technology to drive the adoption of its WebSphere suite of application platform technologies. Recognizing it was a latecomer to the application platform market competing against a powerful incumbent with an established ecosystem in Microsoft, IBM theorized that “in order to build momentum around [Eclipse] and to get more vendors to build their products on top of it, [it] had to make it open source”. IBM’s hypothesis appeared to be correct, as Eclipse became one of the most popular open source projects in the world and established itself as the dominant player within the market of Java Integrated Development Environments. Only four years after IBM made Eclipse open source, a survey conducted by the publishers of SD Times (a popular software development trade magazine at the time) found that approximately two-thirds of all its readers utilized Eclipse in their workplace (Figure 5), followed by IBM’s proprietary WSAD at 21%. Eclipse’s substantive user base as well as vibrant ecosystem of complementary vendors provided a substantial competitive advantage to WSAD. 70% Eclipse 60% IBM WebSphere Studio App. Developer* Borland Jbuilder 50% 40% Sun NetBeans 30% Oracle Jdeveloper 20% JetBrains IntelliJ IDEA* 10% BEA WebLogic Workshop 0% Jan-02 Jan-03 Jan-04 Jan-05 Figure 5 - Results from the "Java Use and Awareness Study" from BZ Research, 2005 24 Microsoft Visual J++ Chapter 2 – Strategic Considerations for Open Source Leadership The specific license that IBM put in place when establishing Eclipse was a derivative of IBM’s Common Public License. The license ensured that IBM was able to commercialize Eclipse technology without having to release the derivative product as open source. As a consequence, IBM was able to leverage Eclipse’s success to bootstrap the ecosystem of its own proprietary WebSphere Studio Application Developer (WSAD) platform product, which provided additional functionality while being fully compatible with complements produced by the Eclipse ecosystem. Although IBM spun off the responsibility of managing Eclipse to an independent nonprofit organization, the Eclipse Foundation, it continues to be a major force driving the continued evolution of the Eclipse platform. One way this is made evident is in the fact that IBM continues to be the single largest contributor to the project (Figure 6). Given IBM’s ability to influence the activities of others to enhance its own offering, it clearly exhibits the qualities of a “platform leader” in this context. Moreover, IBM’s strategic and intentional use of the open source model as a means of bootstrapping a proxy ecosystem for its own proprietary offering suggests that it should be considered an open source platform leader. Eclipse Project Committers by Company IBM 32% IBM Oracle itemis AG Figure 6 - Eclipse Project Committer by Company (excluding committers without and with unknown corporate affiliations). Taken from http://dash.eclipse.org on August 4th, 2014 25 Platform Leadership in Open Source Software The Definition of Open Source Leadership IBM’s utilization of the open source community to accelerate the adoption of its platform technology against an incumbent platform rival is not a particularly unique or novel tactic. As mentioned during the introduction to this paper, Google’s Android, Samsung’s Tizen and Nokia’s Maemo are all open source mobile operating system efforts that were looking to displace Apple’s dominant iOS platform. It is unlikely that these commercial firms opted to open source their platform technology out of altruistic charity. Rather, they adopted the open source model as a strategy intended to accelerate the development of a critical mass of users and complementors in markets marked by network-level competition. However, IBM’s ability to use open source to improve the competitiveness of its proprietary software product is illustrative in demonstrating that open source platform leadership is not limited to a specific type of software license or even business model. It is entirely possible to for a proprietary software vendor or even a service provider to be an “open source platform leader”. As mentioned in the literature review, a “platform leader” as defined by Cusumano and Gawer is a firm that is able to influence the activities of other industry participants in order to create complementary products and solutions that enhance its offering. This definition has broad applicability as the leader can play any number of roles within the ecosystem so long as it “drive[s] industry wide innovation for an evolving system of separately developed pieces of technology” [30]. Building upon this definition, “open source platform leadership” is therefore best described as “a firm’s ability to influence the development of a large number of complementary products” through engagement in open source. The unique characteristic of open source platform leaders is that they participate in open source development with the specific purpose and intention of gaining a platform advantage. While the usage of an open source model may help a platform vendor accelerate the prevalent adoption of its platform, the model also comes with its own unique set of challenges. These challenges are systematically analyzed in the sections to follow in order to provide a holistic framework for understanding the strategic considerations for open source leadership. 26 One could argue that it is more important for open source platform leaders to be forwardlooking in the management of their platforms when compared to their proprietary counterparts. The additional involvement of community contributors and the irreversible nature of “open sourcing” intellectual properties mean that decisions that are relatively easy for proprietary firms to make require more planning and lead time for an open source platform leader to affect. For example, despite IBM’s significant investments and participation in Eclipse, changes to the core platform of Eclipse are governed by the independent Eclipse Foundation (which IBM helped establish) consisting of members of the Eclipse ecosystem, many of whom compete with IBM in different markets. This governance structure significantly increases the friction and latency of manipulating Lever 2 (product technology) and consequently IBM must be more proactive and forward-looking if it wishes to deploy that lever effectively. In order for a platform leader to proactively manage its platform strategy, it must have a holistic understanding of the forces that shape the dynamics of competition in its given market and establish a strategy to manage these forces. The formulation of such a strategy requires a hypothesis on how these forces will shift as the platform evolves over time. A traditional analysis framework used for understanding the forces that affect a given market is Porter’s Five Forces model of industry analysis (Figure 7) [31]. Figure 7 - A reproduction of Porter's Five Forces Model. The horizontal forces represent the critical factors arising from the value chain of the market, while the vertical forces and the center circle represent competitive forces. 27 Platform Leadership in Open Source Software While Porter’s model provides a useful outline for analysis, it was created with the intention of analyzing the overall attractiveness of any given industry and does not specifically consider the unique dynamics of platform-driven markets. In particular, Porter did not consider the critical role that platform complementors play in affecting the competitive balance within a platform market. To address this, industry practitioners such as Intel’s Andrew Grove have augmented Porter’s framework with an additional factor capturing the influence of partners (Figure 8). Figure 8 - Six Forces Diagram, taken from Only the Paranoid Survive [4] Grove’s variant serves as useful scaffolding for the discussions of different factors that open source platform leaders need to understand and manage. In the sections to follow, different candidate factors are identified and reasoned, structured by the categorization provided by Grove’s Six Forces model. For the purpose of this discussion, the considerations brought on by the emergences of new entrants and substitutes are evaluated together. Table 6 presents a summary of these different considerations. An overview of Google’s Android project is presented preceding this discussion to serve as a lighthouse reference for describing these different factors. The relevance of these identified factors to the actual behavior of aspiring open source platform leaders is further validated in the case study on Hadoop in the chapter to follow. 28 Chapter 2 – Strategic Considerations for Open Source Leadership Considerations Description Rivalry The relative intensity of inter-network and intra-network competition shapes and governs the behavior of the open source platform vendor. Vendors must continually adjust their behavior as this will change over time. Suppliers Qualified engineering talents is the primary constraining resource that an open source software vendor requires. A platform contender must understand the specific organization structure of the open source community in order to access the right talents. Complementors Complementors in the open source world can come in the form of commercial allies as well as community contributors. Vendors must form a hypothesis on what are the key complements in order to secure superior or exclusive access. Buyers By establishing a clear understanding of the purchasing process of the platform, platform contenders can establish superior or exclusive access to key intermediaries to secure a competitive advantage in intra-platform competition. They can also develop a platform advantage by injecting themselves in the purchasing process of complements in order to exert greater influence in the operations of the ecosystem. New Entrants The fact that open source platform vendors do not possess exclusive and Substitutes authority to define how the technology is packaged and reused means that alternative modes of platform consumption can emerge from unexpected sources. Platform boundaries can shift without the vendor’s involvement. As a consequence, emergent threats in the form of new direct competition or substitute are more arguably more prevalent in open source businesses. Table 6 - Summary of Strategic Considerations for Open Source Platform Vendors 29 Platform Leadership in Open Source Software Google and Android Android is a Linux-based open source mobile operating systems created by Google using the assets it acquired when the search giant purchased Android Inc. in 2005. Although the ‘core’ aspects of Android is open sourced to the community under the Android Open Source Project (AOSP), Google has been and continues to be the primary engineering force behind the continued evolution of the Android software platform. The firm deploys its considerable engineering resources to work on the ‘next version’ in private before releasing the source to the community. The software typically makes its way to the hands of customers through devices created by hardware device partners such as HTC, LG and Samsung. Google collaborates with these partners through an industry alliance known as the Open Handset Alliance (OHA). Participation in the OHA provides partners with unique access to Googles resources and are generally perceived as a requirement for gaining the license to deliver Google Mobile Services (GMS). Google Mobile Services are complementary (but proprietary) services and components that greatly enhance the value of the system, including applications like Gmail, Google Now, Google Calendar and the Google Play Store. Although many of hardware device partners opt to heavily alter or ‘enhance’ the versions of Android that ship with their devices to create a unique experience that differentiates their offerings to end users, the vast majority of these changes are cosmetic in nature and do not fundamentally change the definition of the platform. Moreover, hardware partners who are members of the OHA are contractually prevented from creating “forked” or “derived” versions of platform, and instead collaborate with Google and others on the continued evolution of Android [32]. In other words, while Google cedes some control of Android’s interface from the perspective of end users to its hardware partners, this arrangement allows Google to remain the definitive authority over the platform’s evolution from the perspective of software complementors (Figure 9). At a conceptual level, this structure is not vastly different with how Microsoft operated its Windows franchise over the years. However, the fact that Android is open source means that Google’s role in defining and providing the platform is displaceable. A firm with sufficient engineering resources and ability to deliver complementary services can theoretically displace Google entirely and propose a different design trajectory for the platform. 30 Chapter 2 – Strategic Considerations for Open Source Leadership Figure 9 - The Android Platform and the roles of Google, hardware partners and other complementors This theoretical scenario unfolded when online retailer Amazon released the Kindle Fire in 2011. Amazon elected not to collaborate with Google as a participant in the Open Handset Alliance, but rather to create its own variant of the system based on what was available through the Android Open Source Project. The Kindle Fire ran a derived or “forked” version of Google’s Android that was later rebranded “Fire OS” with subsequent releases. The Fire OS was largely compatible with applications built for the version of Android from which it was derived, but Amazon replaced all of Google’s cloud and content services with alternatives from Amazon and its partners. Amazon even provided an alternative “App Market” to connect users to applications, offering its own Digital Rights Management (DRM) and payment infrastructure for software vendors in the Android ecosystem. By choosing not to participate in the Open Handset Alliance and “forking” their own version of the platform, Amazon put themselves in a position where they can theoretically choose to evolve Fire OS independent of Google’s influences. Since the release of the Kindle Fire, a number of companies have followed Amazon’s path of creating Android-derived platforms without participating in the OHA. While the majority of these firms do not have the engineering resources that would allow them to realistically challenge Google’s dominion over the architectural trajectory of the Android platform, a number have sizable presence in specific geographic markets such as China and are more than capable of displacing Google as the de facto provider of complementary services 31 Platform Leadership in Open Source Software within their markets. It is also interesting to note that even Microsoft has gotten into the game and forking Android as a solution for developing markets by way of its Nokia acquisition [33]. While the actions of these vendors may have actually contributed to the Android platform’s dominant market share in the mobile platform space, they have clearly been detrimental to Google’s ability to benefit from that dominance. Google’s management of Android is a useful reference for discussing the different factors affecting open source platform strategy. Although the structure of the ecosystem bound together by mobile platforms closely resembles that of the personal computing industry with which many are already familiar, the outcome has been quite different. In particular, the battle for the mobile industry is interesting in that a previously dominant incumbent platform leader (Apple) has been successfully challenged by a new entrant (Google) that has opted to release its platform as open source. Google’s changing behavior as this unfolds is a useful illustration of the dynamic nature of open source platform strategy. 32 Chapter 2 – Strategic Considerations for Open Source Leadership Company Platform Description AliCloud Yun OS According to Alibaba (AliCloud’s parent company), the Yun OS is a Linux-based (China) operating system that utilizes components and tools from the Android Open Source Project to deliver Android app compatibility. Amazon Fire OS Fire OS features an optimized UI for consuming Amazon’s content and services. Application Programing Interfaces (API) have also been extended to promote the unique capabilities of Amazon’s hardware. Baidu Baidu Yi Yi OS displaces Google’s GMS services with Baidu’s implementations. (China) Microsoft Nokia X Nokia X re-skins Android with a look and feel approximating Microsoft’s Windows platform and replaces Googles services with Microsoft’s own. It was originally conceived as a low-cost solution for developing markets. Table 7 – AOSP-derived products by Google competitors [34]–[36] 33 Platform Leadership in Open Source Software Rivalry – Inter-network vs. Intra-network Competition Figure 10 - The competitive threat to a proprietary software platform vendor comes in the form of alternative platforms; open source platform vendors must additionally contend with alternate providers, including the community, for their specific platform. Platform leaders typically possess unique knowledge of the technologies that serve as the technical foundation for its ecosystem; the extent it shares this knowledge is one of the decisions that the firm can take (“Lever 3 – Relationship with Complementors”). For proprietary software platform vendor, this unique knowledge is often encapsulated in the proprietary intellectual property that source code represents. Leveraging this unique asset, the firm is able to act as an effective monopoly within the sub-markets that its network participants represent, as no competitors are capable of displacing their dual roles as the platform provider and sponsors. Competition comes exclusively in the form of alternative platforms or “inter-network” competition. For example, as the exclusive provider of the iOS, Apple Inc. does not need to worry about another firm supplanting its role as the dominant distributor of iOS applications to customers. It is also secure in its position as the dominant provider of development tools to iOS application developers (complementors in this ecosystem). Apple’s competitive concerns stem purely from alternative ecosystems and the possibility of customers or application vendors abandoning the iOS platform for alternatives such as Microsoft’s Windows or Google’s Android platforms. In other words, the dominant strategic concern for proprietary platform vendors is the establishment and sustenance of the platform itself as an industry standard. Compared to their proprietary counterparts, open source platform vendors face an additional dimension of complexity affecting its competitive strategy. Beyond the challenges of 34 Chapter 2 – Strategic Considerations for Open Source Leadership establishing its platform as the dominant industry standard, open source vendors must additionally work to establish itself as the primary provider of that standard. This challenge is clearly evident in Google’s Android ecosystem. Like Apple, Google strives to establish Android as the dominant platform against alternatives like iOS and Windows Phone in the mobile computing space. However, due to its decision to open source the development of Android, Google additionally faces competition in its role as the provider of platform technologies to users and complementors within the Android ecosystem, as exemplified by its struggles with “platform wannabes" such as Amazon and Alibaba. This struggle illustrates a fundamental tension that an open source platform leader faces: balancing the occasionally conflicting needs of inter-network competition with those of intra-network competition (Figure 10). The relative intensity of these two different types of competition waxes and wanes over the course of platform evolution, and the open source platform vendor is likely to find itself adjusting its position on the “Four Levers of Platform Leadership” as a consequence. In order to ensure that the Android ecosystem would attract the maximum number of software and hardware complementors away from the incumbent leader, Google took pains in the inception of the Android platform to ensure that it was architected in an open and modular manner (Lever 2). It collaborated openly with hardware partners, software vendors and the open source community (Lever 3). As Android establishes itself as the de facto leader within the mobile space (nearly 85% of all smartphones shipped in Q2 2014 were Android-based [37]), Google’s primary strategic concerns has arguably shifted from winning against competitive platforms to sustaining its position as the primary beneficiary of Android’s success. While Google’s decisions to share its technology have clearly contributed to Android’s dramatic growth in the marketplace, they have also lowered the unique competitive advantages of Google as the Android platform provider. As Google’s focus shifts away from alternative platforms to alternative providers of the Android platform, its behavior also correspondingly changes. One shift that Google has been slowly making pertains to its decisions around the functionality that is delivered as a part of the open source “core” as opposed to the proprietary services and extensions that it exclusively offers (Lever 2). In October of 2013, Ron Amadeo of Ars Technica outlined the various functionality that Google delivers through proprietary channels that it had previously included as part of core Android [38]. Much of this functionality were 35 Platform Leadership in Open Source Software aided by Google’s unique proprietary cloud capabilities (such as Gmail or enhanced search) and therefore Google’s decision to encapsulate it as proprietary extensions can be justified on a technical basis. However, the decision to deliver self-contained enhancements, such as the enhancements to the basic keyboards, appear to be deliberate decisions intended to further differentiate the capabilities of Google’s Android versus alternatives offered by the community or competing providers. Capability Open Source Version Proprietary (Date Introduced) Search AOSP Search Google Search (August 2010) Music Player AOSP Music Google Play Music (May 2010) Calendar Calendar Google Calendar (October 2012) Keyboard AOSP Keyboard Google Keyboard (June 2013) Camera AOSP Camera Google Camera (April 2014) Messaging AOSP Messaging Google Hangouts (May 2013) Table 8 - Google's shift of investment into proprietary capabilities. Content adapted from Ars Technica [38]. Google’s approach in interacting with external partners has also shifted as it seeks to leverage its relationship in order to lock out other Android platform contenders (Lever 3). Recognizing that its contributions to AOSP does not provide it with any legal means to minimize the ‘forking’ of its core, Google created the Open Handset Alliance at the inception of Android precisely to provide this means. As mentioned earlier, while ‘forking’ is a fairly normal and desired phenomenon in open source projects, the OHA’s anti-forking restriction explicitly prevents this from happening. There is a broad understanding in the industry that participation in the OHA is a prerequisite for meaningful collaboration with Google on Android, and consequently the majority of hardware device manufacturers are members of the Open Handset Alliance. By putting in place this agreement, Google significantly limits the channels through which alternative software platform vendors can create Android-derived products, as leading hardware vendors participating in the OHA are restricted from collaborating with them. Amazon 36 Chapter 2 – Strategic Considerations for Open Source Leadership experienced this when it searched for hardware partners to help build its Fire line of devices, ultimately settling on an original equipment manufacturer with minimal prior exposure to the mobile industry as a result. In the recent past, Google has been aggressive in the enforcement of this agreement, going as far as threatening OHA member Acer Computers with the termination of its Google Mobile Service license to prevent the hardware manufacturer from shipping a device with Alibaba’s Yun OS despite some controversy about whether the Yun OS should technically be considered a “fork” of Android [39]. Google’s changes in behavior illustrate the dynamic nature of managing the tension between inter-network and intra-network competition. As mentioned earlier, open source platform leaders must proactively hypothesize how the dynamics of competition will play out, in order to put in place mechanisms that offer competitive leverage later. In the Android example above, had Google failed to anticipate the emergence of “forks” such as those created by Amazon, it would not have put in place the anti-forking clause that provided it with one of its few means to deter a well-resourced and capable competitor from challenging it as Android’s platform leader. 37 Platform Leadership in Open Source Software Suppliers – Securing the Upstream Value Chain The primary constraining ‘supply’ of the software industry is engineering talent. Depending on the specific domain of software, the talent required may be highly specialized. For example, IBM’s inception of the Eclipse project was made possible by the unique and highly specialized competencies that they received through their acquisition of Objects Technologies International. If the required engineering talent is scare, the possession of such human resources is a significant barrier of protection for open source vendors even if their software is highly open from a licensing perspective. Moreover, even in fields where capable talents are not scarce, the structure of many open source projects impose limits on the supply of engineers who can materially affect the design of a given project. In many open source projects of scale, access to the main code-line is governed by relatively small group of individuals who have demonstrated competence with that project. Depending on the project structure, this group may be known as “committers”, “reviewers” or “maintainers”. More importantly, there are often official or de facto technical leaders in most FOSS projects of scale who are responsible for making the major design decisions; the authority of granting “committer” status to individual contributors is sometimes also held by this group. Typically, this leadership group is kept fairly small. Securing access to this group is therefore a critical determinant in an open source platform contender’s ability to influencing its upstream value chain. PMC Chair PMC Member • The Chair of a Project Management Committee (PMC) is appointed by the Board from the PMC Members. The PMC as a whole is the entity that controls and leads the project. The Chair is the interface between the Board and the Project. • A PMC member is a developer or a committer that was elected due to merit for the evolution of the project and demonstration of commitment. They have write access to the code repository, an apache.org mail address, the right to vote for the community-related decisions and the right to propose an active user for committership. The PMC as a whole is the entity that controls the project, nobody else. • A committer is a developer that was given write access to the code repository and has a signed Contributor License Agreement (CLA) on file. They have an apache.org mail address. Not needing to depend on other people for the patches, they are actually making short-term decisions for the project. The PMC can (even tacitly) agree and approve it into permanency, or they can reject it. Remember that the PMC makes the decisions, not the individual people Committer • A developer is a user who contributes to a project in the form of code or documentation. They take extra steps to participate in a project, are active on the developer mailing list, participate in discussions, provide patches, documentation, suggestions, and criticism. Developers are also known as contributors . Developer Figure 11 - Hierarchy of influence within an Apache Software Foundation project. Adapted from the ASF [40] 38 Chapter 2 – Strategic Considerations for Open Source Leadership There is tremendous variety with regards to the contribution model and distribution of decision-making authority amongst open source projects (Table 9). Aspiring open source platform leaders must understand the decision-making structure for their prospective community in order to secure the resources required to affect the technological trajectory of their platform (Lever 2). For example, for projects governed by the Apache Software Foundation, “committer” status is relatively scarce and is granted by the Project Management Committee (PMC), which is also responsible for resolving the major design decisions affecting the project (Figure 11). As a consequence of this, aspiring platform firms that wish to affect the technological trajectory of the project must secure some critical mass of individual committers as well as adequate representation within the PMC. Given that PMC members “are participating as individuals… affiliations do not cloud the personal contributions”, this means that Apache-based platform firms must retain the services of the specific individuals who already reside on the PMC if the firm wishes to influence the design of the technology. In contrast, the Linux development process operates on a much more open “bazaar” style basis, with the majority of design decisions being made via publically accessible mailing lists and the only ‘governance’ process being the actual mechanics by which a “maintainer” reviews and integrates individual submitted patches to the mainstream code-line. It is entirely possible for a firm to hugely affect the design trajectory of Linux without employing anyone who is an official “maintainer” of a Linux module. This open decision making process results in a significantly larger supply of engineering resources who can make substantial contributions when compared to the more constrained pool of committers in an Apache-governed project. Regardless of their specific organizational structure, most open source communities define themselves as being transparent meritocracies. This means that authority and influence within the community arise as a consequence of demonstrated contributions within the community rather than role or rank assigned by some “higher” authority. This leads to the emergence of de facto technology leaders in most open source communities. Paradoxically, the “openness” of this meritocratic philosophy actually greatly restricts the supply of talent that an aspiring platform leader can acquire in order lead the design trajectory of an open source platform at any given time. While a proprietary platform leader can bestow any capable candidate with the authority to lead the technical direction of a proprietary platform, an aspiring open source platform must look to employ an established leader of the community if it wishes to 39 Platform Leadership in Open Source Software secure influence over the trajectory of the platform. Therefore, the attraction and retention of highly-visible community leaders is a critical aspect of establishing and maintaining platform leadership in the open source world. Community Authority Official Description Apache Project Management “The PMC is the vehicle through which decision making Software Committee (PMC) power and responsibility for oversight is devolved to developers.”[41] Foundation PMC Chair “[The] chair is a facilitator and their role within the PMC is to ensure that everyone has a chance to be heard and to enable meetings to flow smoothly.” [41] Eclipse Project Management “ensure that their Project is operating effectively by Software Committee guiding the overall direction and by removing obstacles, Foundation Project Lead solving problems, and resolving conflicts;” “ultimately responsible for ensuring that the Eclipse Development Process is understood and followed by their project”[42] Architectural Council “responsible for… monitoring, guiding, and influencing the software architectures used by Projects” [42] Planning Council “The Planning Council is further responsible for crossproject planning, architectural issues, user interface conflicts, and all other coordination and integration issues.” Linux Maintainers development tree, or returned for revision” [43] Foundation Mozilla Foundation “determines whether the code should be accepted into the Module Owners “responsible for leading the development of a module of code or a community activity” [44] 40 Chapter 2 – Strategic Considerations for Open Source Leadership Release Drivers “provide guidance to developers as to which bug fixes are important for a given release and also make a range of tree management decisions.” [44] Super-Reviewers “approval of a super-reviewer is generally required to check in code” [44] Ultimate Decision- “The ultimate decision-maker(s) are trusted members of Makers the community who have the final say in the case of disputes. This is a model followed by many successful open source projects, although most of those communities only have one person in this role, and they are sometimes called the ‘benevolent dictator’” [44]. Table 9 - Decision Making Authorities in different Open Source communities It is worth noting that Google sidestepped many of these issues in its establishment of the Android Open Source Project. Although the source code of Android is publically published and contributions from the community are welcome, the fact that each new version of Android is designed behind closed doors at Google means that some of the communal and meritocratic nature of open source development is absent from Android’s development. As a consequence, although Google does not benefit from the power of community development that motivates most open source projects, it is also not hindered by the constraints that community development imposes. As most open source platform projects are complex efforts composing of a hierarchy of dependent sub-projects, platform vendors must thoroughly understand the architecture of the platform and formulate their position on which sub-projects are strategic in order to secure the right talents for affecting the platform. For example, at the time of writing, the Eclipse Platform consists of twelve top-level projects, which are in turn composed of 243 sub-projects [45]. A firm wishing to become a platform leader based on Eclipse must decide which of these 243 projects materially affect the platform from the perspective that matters to it and invest its engineering resources appropriately. Each platform provider within the same ecosystem might 41 Platform Leadership in Open Source Software hold a different perspective on which modules are most critical depending on its hypothesis of which sides of the network its wishes to focus on. In other words, aspiring platform leaders are most likely interested in the projects that represent external interfaces of platform complements that they see as most strategic, or modules that represent user interfaces if they see driving user adoption as most critical. The extent to which a firm may find it necessary to invest in the internal ‘core’ of a platform significantly depends on the maturity of the platform and the level of inter-platform competition. If the platform is relatively immature and unstable, and the level of inter-network competition is intense, a platform would find it necessary to focus their efforts on acquiring the talents need to stabilize the core and make the platform more viable. As a platform reaches maturity and the focus of competition shifts from inter-network competition to intra-competition, platform vendors may find it less important to invest in the ‘core’ technologies but rather focus their energies in affecting the peripheries that act as interfaces into the platform or in delivering capabilities that differentiate their specific versions of the platform. Generalizing from all of this, it becomes clear that how a firm chooses to collaborate with the open source community significantly impacts whether (and how) it can secure the critical resources necessary for success. 42 Chapter 2 – Strategic Considerations for Open Source Leadership Complementors – Identifying and Securing Critical Complements Beyond engineering talent, platform builders require the supply of key complements in order to make their platform viable. While aspiring platform leaders seek to engage a large number of industry participants to their platforms and provide complementary products or services, it is sometimes the case that providers of specific types of key complements are few or even nonexistent. In such cases, the platform leader may choose to intervene by either providing extraordinary support for those complementors or by directly participating in the complements ecosystems itself to boost the supply of the required complements. For example, Cusumano and Gawer documented Intel’s creation of the Content Group in order to help spur the creation of scarce multimedia software at the time. The pair also documented Intel’s venture into the chipset and motherboard business after finding that the existing vendors in the business were not keeping up with the needs of the platform [10]. However, beyond reinforcing the ecosystem to enable network effects, controlling key complements through intervention and involvement in complement creation can also arm a platform leader with competitive leverage against alternative platform vendors. The management of key complements is a tactic that is well recognized and overtly managed in certain markets such as the video games consoles. In their survey of the various strategies of video game console makers from 2005 through to 2007, Daidj and Isckia found that Microsoft relied heavily on the advantage gained by exclusives such as the Halo franchise to drive the adoption of their platform, the Xbox 360 [46]. Perhaps more interestingly, James Prieger and Wei-Min Hu pointed out that in a separate paper on the same industry that possessing exclusive complements are only effective in driving platform adoption if the majority of complements available are non-exclusive. Moreover, the pair found that a “small amount of exclusivity… would be enough to foreclose competitors from all the important sources of supply of the complementary good”[47]. The leverage afforded by the exclusive access to key complements is particularly relevant in the intra-network competition for open source platform as there is generally a lowered level of differentiation amongst vendors of the same platform and also no technical barriers creating differences in ecosystems. In other words, open source platform vendors often compete within the conditions for effective complement-exclusivity advantages that Prieger and Hu identified. 43 Platform Leadership in Open Source Software This fact has also manifested itself in the Android case, where Google chose to withhold its complementary mobile services (e.g. Gmail, Maps, Google Now etc.) from Android variants. Given that Google makes the complementary services within its Google Mobile Services portfolio available to even competing platforms such as iOS, it may initially appear odd that Google would refuse to provide these capabilities to other Android systems. However, Google’s decision reflects the different competitive dynamics of inter-network and intra-network competition. Since iOS and Android are fundamentally different platforms with significantly differences in capabilities and ecosystems, the availability of Google Mobile Services is less likely to materially affect a customer’s relative preference for Google’s platform in comparison to Apple’s. In contrast, given that Amazon’s Fire OS and Google’s Android are very similar technical platforms with a much smaller difference in capabilities and are application ecosystems with significant overlaps (applications that are available on Android can run on Amazon’s Fire devices if Google’s proprietary extensions are not used, or if the developers substitute Google’s services with Amazon’s offering). Therefore, the availability or absence of Google’s class leading services may materially affect consumer preferences for one vendor’s Android variant over another. In light of this understanding, Google’s decision to withhold its Google Mobile Services (e.g. Gmail, Google Maps, Search) from users of alternative Android platform appears to be a sensible and strategic means of creating greater differentiation between its platform and those of its intra-network rivals. Unlike Google, most firms do not have the luxury of possessing exclusive ownership to key platform complements and often have to invest in cultivating relationships with partners just to secure access for their platforms. In order to do so in a cost-effective and timely manner, aspiring open source platform leaders must form clear hypotheses on the critical type of complements in order to secure superior access either through internal development or by developing partner relationships. 44 Chapter 2 – Strategic Considerations for Open Source Leadership Buyers – Controlling the Path to the Customer As the right-hand side of Figure 10 illustrated, an open source platform facilitates the technical connection between customers and complement creators without consideration of who is providing the underlying platform. For example, Android application developers can largely be assured that their product can technically be sold to users of Google’s Android as well as Amazon’s Fire OS with only a modest amount of additional investment. In other words, the technical platform is often undifferentiated from the perspective of the complement creator, even if the provider manages to differentiate its platform variant to end consumers. In fact, the complement creator prefers a greater level of commonality across different platform variants in order to minimize the amount of customization for its products. As a result of this, a platform provider needs to find other means for differentiating itself from the other providers for the same platform. One way that a platform provider can differentiate itself from its rivals is by controlling the complementor’s path to platform customers. Depending on the market that the platform serves in, the relationship between the platform provider and the end customer may vary greatly in nature and intensity. For example, in enterprise software, vendor-customer relationships tend to be highly intense as vendors tend to have relative few customers, each representing non-trivial fraction of a vendor’s revenues. As a result, each customer holds substantive bargaining power. While such a structure may appear to weaken the bargaining position of platform vendors, such a structure actually provides important leverage for platform vendors to affect the behaviors of complementors. If an open source platform provider is able to forge strong and exclusive relationships with key customers, and there are relatively few customers on the market, it can act as an effective monopoly on the ecosystem from the perspective of the complementor even if it is not the exclusive provider of the platform technology. 45 Platform Leadership in Open Source Software Figure 12 – The Purchase Process of Complements – While an open source platform enables a complement provider to provide its products to a given customer, the onus is still on the complement provider and customer to discover each other. If a platform provider can facilitate this connection in a superior manner, it can differentiate itself from alternative platform providers. Beyond establishing exclusive relationships to key customers, a platform provider can also assert influence over complementors by positioning itself as a facilitator in the purchase process of complements (Figure 12). Although an open source platform provides a unified technical infrastructure that binds an ecosystem together, the existence of a unified platform does not imply the existence of a unified purchase process. In other words, while a complementor's product can be delivered to the customer thanks to the technical infrastructure provided by the platform, the business process of discovering, evaluating and purchasing the solution is not a problem resolved by the existence of an open source platform. Consequently, if a platform provider can facilitate this process better than other platform providers and better than network participants can accomplish on their own, that platform provider is able to create a preference for its platform variant over others. As it turns out, this has also been an important tactic in the intra-network competition between platform providers in the Android ecosystem. Each of the different Android platform providers have invested in their own proprietary application marketplaces in order to facilitate the purchase of applications by customers. While application vendors are able to create and distribute applications on their own, these application marketplaces provide customers with a simpler and faster means of discovering and purchasing new complementary “apps”. Although alternative third-party marketplaces exists for the same purpose, the marketplace offered by a platform provider has the distinct advantage of being pre-installed on the devices that ship with its platform. In order to secure superior access to these channels, the application vendors are compelled to establish sometimes exclusive relationships with a specific platform provider, even if there is no technical reason for it. 46 Chapter 2 – Strategic Considerations for Open Source Leadership While some may argue that the primary motivation of creating an application store is to monetize the activities within a platform ecosystem, Google has made decisions that appear to contradict such an objective. For example, the company limits access to its electronic application and content marketplace (“Google Play”) to devices produced by manufacturers that it approves (effectively manufacturers that participate in the OHA) [48]. If the company’s objective was to capitalize on the transactions that occur within the Android ecosystem, it would likely have taken a similar approach to Amazon; Amazon not only makes it App Market available on its Fire devices, but also Google-approved Android devices as well as Blackberry 10 devices [49]. Despite Amazon’s efforts, Google’s “Play” store is the largest marketplace for Android-compatible applications with an estimated 1.3 million applications compared to Amazon’s 240,000 as of June 2014 [50]. Having exclusive access to such a vibrant marketplace affords Google’s Android offering with a substantial competitive advantage over its intranetwork competitors. Amazon’s response of its own App Market was not purely a mean of matching Google’s channel for delivering complements to its platform consumers, but also a means of giving the company leverage to influence the architecture and technical interfaces of a platform it otherwise has relatively little influence over. Developers who choose to sell their products on Amazon’s App Market need to ensure that their apps work with Amazon’s devices, which in turn requires the substitution of Google’s proprietary services and APIs (e.g. Google Maps) with Amazon’s version. The aforementioned mechanisms for platform leverage rely upon the platform provider’s involvement in the value chain between a complement producer and the customer beyond supplying the pure technical infrastructure offered by the platform. However, a platform provider can also deter competitors by studying the value chain between the platform itself and the customer. It is often the case that the path between the platform provider and the customer is actually an indirect route controlled by a few intermediaries. Consequently, an open source platform provider can attempt to recreate the effective monopoly of a proprietary platform vendor by establishing exclusive relationships with those intermediaries. This was clearly a tactic utilized by Google in an attempt to limit the fragmentation of the platform through the Open Handset Alliance. As mentioned earlier, a customer of the mobile operating system typically adopt a given platform by purchasing a device from one of several major device manufacturers. By securing effectively exclusive relationships with those hardware device 47 Platform Leadership in Open Source Software manufacturers through its Open Handset Alliance program, Google greatly restricts the extent to which alternate platforms providers can displace it as Android’s platform leader. Google’s ability to enforce its platform leadership through the OHA program hinged upon the company’s identification of hardware vendors as the critical nodes on the value chain to customers in the mobile industry. In the highly intertwined marketplace that many open source platform vendors compete in, the nodes and relationships connecting the platform vendors to the customers can be complicated, often times resembling a network rather than a chain. In the case of Android, Google additionally identified network providers and implementation partners (such as Accenture or Wipro) as nodes on the path to the customer and have consequently included such firms in its Open Handset Alliance program. The preemptive identification of these critical nodes allowed Google to establish superior relationships with them, and put in place legal agreements that ensures exclusivity (i.e. the “anti-forking” clause of the OHA). Much like how different firms may come to different perspectives on what modules within the platform are most strategic, different firms may also hold different hypotheses on which relationships and complements are most critical to establishing control over the ecosystem. The hypothesis held by the firm on which types of relationships (or even which specific relationship) are most critical to manage may even significantly impact the firm’s “scope of the firm” (Lever 1) decisions as firms seek to avoid conflict with key members of the business network. The relationships that affect the purchasing process of the platform and of complements are amongst those most critical to establishing ecosystem influence. While the strategic management of these external relationships are also important to proprietary platform vendors, this dimension of platform management is especially critical to the open source platform vendor in light of intra-platform competition with alternative vendors. Given that open source platform vendors are technically replaceable, the correct identification and possession of those key relationships serve as a critical means of asserting platform leadership for open source firms. 48 Chapter 2 – Strategic Considerations for Open Source Leadership Substitutes and New Entrants – The Threat of Shifting Platform Boundaries Beyond competing with alternative platforms and rival providers of the same platform, platform vendors must consider the threat of substitute technologies that can be considered alternatives to adopting a platform altogether. While all product firms – platform or otherwise, proprietary or not – face the same threat of substitution, platform firms in general and open source vendors in particular need to be specifically aware of alternatives that can emerge from changes in the definition of platform boundaries. As mentioned earlier in the literature review, scholars have found that “platform envelopment” is one of the most effective strategies for displacing an entrenched platform. In particular, the “foreclosure attack” can be viewed as a redefinition of the platform boundaries to a vastly larger scope, substituting the need of a specific platform with capabilities integrated into a broader platform with known demand. One can reason that open source platforms appear to be more susceptible to this type of substitution as there are lowered technical barriers for an attacker to integrate the capabilities of an open source platform into the context of a broader one. While all platform vendors face the threat of envelopment, open source platforms are perhaps uniquely susceptible to the threat posed by shifting platform boundaries. An open source platform is typically a complex system of subsystems loosely connected through a network of related projects. As a consequence, enterprising vendors or members of the community can choose to re-interpret platform boundaries and create new offerings bundling different subprojects together. Although proprietary platforms are also complex compositions of smaller subsystems, proprietary intellectual property holders possess the unique ability to determine how these internal subsystems are bundled together. For example, a customer cannot choose to adopt just one aspect of the Apple iOS operating system without purchasing the entire platform. The entire definition of what it means to consume the platform is at the discretion of Apple, motivated at least partially by the business objectives it faces at any time. No other participant within the ecosystem holds the power to define platform boundaries. 49 Platform Leadership in Open Source Software Figure 13 - The ability for vendors to create distributions can have the undesirable effect of fragmenting the ecosystem if there is variation between platform products that impact platform users. In the above example, distribution 1 and 2 are variants of the platform produced by different open source vendors. In the open source world, members of the community does not only have the ability to substitute one implementation of a subsystem with another, but also to define the boundaries of the platform differently. If a member of the community believes a specific subsystem is useful and “core”, it can choose to bundle it as its own “distribution” of the platform. As mentioned in the literature review of this paper, a license that enables distribution creation is a fundamental criteria for a software to be considered “open source” and was the original business model upon which the largest open source business in history (Redhat, Inc.) was based [24]. Distribution creation creates variations in the definition platform which can blur platform boundaries and can fragment the ecosystem if the components being varied touch interface points with platform consumers. Figure 13 illustrates this with a hypothetical open source platform with two distributions. Distribution 2 varies from distribution 1 in that subsystem 2 and 4 have been substituted with subsystems A and B respectively. While the replacement of subsystem 4 with subsystem B is simply technical implementation decision that does not impact platform users, substituting subsystem 2 with A can mean the Type A complements created for distribution A do not work with distribution B, fragmenting the ecosystem and compromising the strength of indirect network effects. Similar fragmentation can occur if the distributions introduced or omitted interfaces differently, creating a fuzzy platform boundary. As an example, the two major variants of the desktop operating environment (Gnome and KDE) distributed by major distribution 50 Chapter 2 – Strategic Considerations for Open Source Leadership creators fragmented the interfaces that desktop application developers used to create desktop applications for Linux until a common interface was established through Project Portland by the Desktop Linux Working Group [51]. This type of platform fragmentation reduces network effects and harms the platform’s ability to compete with alternative platforms. Beyond the challenge of fragmentation, open source platform leaders also face the possibility for key subsystems of the platform to be reused in other contexts by other firms or individuals as a means of “hijacking” the platform’s ecosystem. Amongst the large number of subsystems and modules within a software platform, it is often the case that only a fraction of those subsystems are materially involved in enabling interactions with a given type of complements. If a firm is able to isolate these core subsystems, it can reuse these modules to allow those same complements to interact with another product or even platform. While such an approach is theoretically possible with proprietary platforms with open interfaces, the technical barriers to execute such a tactic is extraordinarily high. Competing vendors who want to leverage complements built for a specific proprietary platform must reverse engineer the implementation underlying the platform based on the interface definition and replicate the behaviors of the base platform. Depending on the implementation technology and the degree of coupling between the complement and platform, this is a task that ranges from difficult to effectively impossible. However, within the open source world, a firm seeking to ‘hijack’ a platform’s complements does not need to perfectly replicate the behavior of some unknown black box, but rather directly modify and integrate the core components required. As an example, Figure 14 shows a hypothetical platform comprising of subsystems one through five. Suppose that Type B complements are desirable to an alternative platform product competing in an adjacent platform market or an inter-network competitor. Given that subsystem 2 is the only interface that Type B complements interact with, a competitor can simply integrate that subsystem into its own platform and offer compatibility and support with Type B complements. Since subsystem 2 requires supporting capabilities from subsystem 3 and 4, the competing vendor can choose to also integrate those components into its product, or replace those subsystems with its own. 51 Platform Leadership in Open Source Software Figure 14 – Ecosystem hijacking - competing vendors can isolate the interfacing and enable subsystems for a given complement type and choose to redeploy it in another context to expedite their own objectives. Google experienced this hijacking phenomenon with the Android platform. In case of the Android platform, the application framework represented by the Android Software Development Kit (SDK) and the execution environment known as the Dalvik Runtime are the primary components involved in supporting the consumption of Android applications on the platform (left-hand side of Figure 15). While the application framework does depend on capabilities provide by other core libraries within the Android platform, the Dalvik module represented the bulk of the required complexity. Blackberry was able to integrate Dalvik into its own proprietary Blackberry 10 OS based on the QNX kernel it had acquired. In doing so, Blackberry was able to bootstrap its own ecosystems by supporting Android applications and vastly reducing the cost of multi-homing for application vendors to support its platform [52]. 52 Chapter 2 – Strategic Considerations for Open Source Leadership Figure 15 – High-level System architecture of Android and Blackberry OS 10. Adapted from http://developer.android.com/images/system-architecture.jpg While numerous other factors have prevented Blackberry from meaningfully contending for a platform leadership position, the fact Blackberry was able to leverage Android’s success to substantially enhance the competitiveness of its own proprietary platform illustrates the viability of this hijacking tactic for the aggressor, and the threat it poses to the platform incumbent. It should be possible that Blackberry’s tactic would have been nearly impossible or illegal with a proprietary platform such as iOS. The lack of control over an open source platform’s design and architecture (Lever 2) creates significant risks for a platform leader. These risks require constant monitoring and management. The common open source business model of distribution creation has the potential of fragmenting the platform and reducing network effects. Distribution creation can also shift platform boundaries, rendering key assets and assumptions held by the firm invalid by including or excluding modules. The availability of source code and the ease with which a platform can be decomposed into individual parts also means that open source platforms can be “hijacked” as competitors can repurpose key subsystems for competing purposes. The open source platform leader needs to remain vigilant in identifying such threats and ensuring that it has countermeasures to defend against them. 53 Platform Leadership in Open Source Software Chapter Summary The mobile operating system space is a highly competitive market involving some of the technology industry’s most powerful players. The fact that the leading platform (by market share) is an open source late entrant is a testament to the increasing relevance of the open source model in the modern computing industry. However, Google’s Android project is an unconventional open source project in that the contributions of the open source community is largely limited; Google explicitly chose not to take advantage of the talent within the open source community in order to maintain greater control over the trajectory of the platform. This decision clearly indicates that Google does not perceive the primary reason for participating in open source to be the ability to leverage the resources of the open source community, but rather some other attribute. With the reasonable assumption that the profit-seeking corporate entity known as Google is not releasing the intellectual property behind Android for purely altruistic reasons, one can reasonably infer that the decision to adopt the open source model stems from a desire to accelerate adoption on both sides of the platform and to catalyze network effects. While the use of an open source model can remove adoption barriers for platform users, particularly on the side of complement producers, the forfeiture of intellectual property rights and design authority significantly limit the means that a platform leader can use to direct the trajectory of the ecosystem for its own benefit. As a result, platform contenders must find alternative means of exerting their influence. In order to do so, contenders must first determine whether inter-network competition (winning against alternative platforms) or intra-network competition (winning against alternative providers) is the more immediate need and then form a perspective on how this may shift over time. In addition, the firm must stay abreast of shifts in the perceived platform boundary, which can expand or contract without their approval, and ensure that it has the means to remain the primary benefactor of the platform’s continued growth. An understanding of the above two factors will shape the behavior of the vendor with regards to how it interacts with the key suppliers (engineering talent and complement providers) and the extent to which it intervenes in the purchasing process of complements and the platform itself. Google’s experience with Android and its challenges in maintaining control of the platform that it sponsored illustrates the many challenges of managing an open source platform. Despite the fact that Google has chosen to adopt a relatively closed development model for 54 Chapter 2 – Strategic Considerations for Open Source Leadership advancing Android, competing vendors are fracturing the platform in a manner that is incompatible with Google’s business objectives. Given that these competing efforts are completely legal from a licensing perspective, one of Google’s few means of influencing the ecosystem comes from its control over the key complements it controls within its Google Mobile Services portfolio. By controlling that key asset and leveraging its initial exclusivity of Android development expertise, Google is able to strike critical agreements with members of the value chain in an effort to block out alternative platform providers. These agreements help Google remain the de facto platform leader for Android, despite the fact that powerful rivals have emerged. Perhaps the most surprising lesson from the Google case study is that open source platform leadership may require access to substantial complement assets and capabilities. Given that Google is largely staffing the development of Android with its own employees with little contribution from the community, it appears fair to assert that the decision to open source Android has not reduced the amount of effort for Google to launch its own mobile platform offering. However, it is unlikely that Android would have experienced its level of success if it was launched as a proprietary offering; therein lies the double edged sword of an open source platform strategy. 55 Platform Leadership in Open Source Software This page is intentionally left blank. 56 Chapter 3 – A Case Study on Hadoop History and Origins Doug Cutting and Mike Cafarella was struggling to solve major scalability problems with their open source web search engine project, Apache Nutch, when Google Engineers Jeffrey Dean and Sanjay Ghemawat published their paper on MapReduce in December 2004 [53]. Their implementation of the MapReduce idea ultimately led to the creation of the ‘big data’ platform now known as Apache Hadoop. Search engines such as Nutch need to traverse billions of pages in order to generate a lookup data structure known as a search index, and this is a computationally expensive endeavor that require the storage and processing of an enormous amount of data. Given modern hardware, such a challenge that could only be reasonably tackled if the work were massively parallelized between hundreds or even thousands of computers (also known as nodes) working in a coordinated manner. The complexity of managing this type of large-scale distributed computing was beyond what Cutting and Cafarella were able to tackle as part-time open source software developers. The MapReduce paper represented an elegant solution to this problem by offering a simple programming model for describing parallelizable processing algorithms and a framework for executing them. This paper, in combination with a previous paper on the Google File System [54], described a robust and general-purpose distributed data processing platform for exactly the type of batch processing that Nutch was doing. Recognizing this, Cutting and Cafarella implemented the ideas described in the papers using the Java programming language and ported the major algorithms in Nutch to this framework. This effort allowed Nutch to scale significantly beyond what the pair had been able to achieve with their previous homegrown efforts. Around the same time, internet search provider Yahoo! was prototyping a redesign of its own distributed processing infrastructure called “Dreadnaught” based on the same MapReduce and GFS papers under the leadership of Eric Baldeschwieler. After discovering Cutting and Cafarella’s effort with Nutch, the firm decided to abandon its internal development and adopt the pair’s work. According to Owen O’Malley, a founding member of Hadoop-vendor Hortonworks and a member of the original Yahoo! team, there were two main reasons that the team abandoned 57 Platform Leadership in Open Source Software its efforts in favor of what was being done in Nutch. Firstly, Cutting and Cafarella’s implementation was already proven to scale out to dozens of machine in Nutch, while Yahoo!’s efforts were less mature and unproven. Adopting Hadoop would allow the Yahoo! team to roll out a cluster of machines for its research staff to experiment with immediately. Secondly, the individual developers on the Yahoo! team had a preference for working in open source and they had an easier time convincing the firm’s legal department to do that with Nutch than with Dreadnaught, since Nutch was already available to the open source community [55]. Yahoo!’s decision to embrace the Nutch framework was aided by a trio of supportive executive sponsors in Qi Lu, Jan Pederson and Raymie Stata, who were leading the search division at Yahoo! in different capacities at the time. In particular, Stata was a director on the board of the Nutch Foundation and was familiar with both Hadoop and the team behind it. Yahoo! hired Doug Cutting in January 2006 and spun Nutch’s distributed processing framework into its own Apache open source project a month later. Cutting arbitrarily named the project after his young son’s toy elephant and Hadoop was born. 58 Chapter 3 – A Case Study on Hadoop Hadoop and the Big Data Phenomenon Today, Hadoop is associated with the phenomenon known as “Big Data”. The term “Big Data” is attributed to John Mashey of Silicon Graphics and is used to refer to both the opportunities of, and the challenges with, the rapid growth of available data [56][57]. Doug Laney of the META Group published a research note in 2001 entitled “3D Data Management: Controlling Data Volume, Variety and Velocity” which identifies three primary dimensions that drive the complexities of managing “big data” [58]. In the note, Laney points out that conventional approaches to data management have limits along each of these dimensions. Data that exceeds these limits require novel techniques to be employed. In 2012, Laney (then at the Gartner Group) published an often-cited definition of Big Data: “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."[59]. Characteristic Description Volume The amount of data that needs to be stored, processed and analyzed. When this quantity is increased dramatically, conventional storage and processing techniques either fail disastrously or perform unacceptably. Variety The different types of data that needs to stored and processed. Conventional data management has been focused on the management of structured, tabular data generated by transactional systems. Velocity The speed at which data needs to be stored, processed or retrieved. Table 10 - The Three V's of Big Data With the advent of the internet and various connected, sensor-enabled machines, new data sources have emerged that significantly increase the requirements on all three dimensions. Moreover, the value captured within these new sources of data is unknown until they are analyzed. While there is often general acceptance that there is value locked within these ‘big data’ sources, the specific means through which that value is unlocked is often unknown at the 59 Platform Leadership in Open Source Software time of data collection. With conventional data management technologies, the cost of collecting such data is often too high to justify the upfront investment for collecting the data. In their 2012 survey, Forrester Research found that 88% of data collected by enterprises were discarded because the organization could not justify the costs of collecting it. Hadoop addresses this problem by providing a cost-effective, flexible and scalable platform for collecting and analyzing data with minimal upfront costs. Understanding how Hadoop resolves this problem requires a rudimentary understanding of how a Hadoop-based data management approach fundamentally differs from conventional data management technologies such as the relational database. The Relational Database Since the late 1970s, the dominant design for conventional database management systems has been the relational model. Invented by IBM computer scientist Edgar Codd in the early 1970s, the relational model offers a flexible means of storing information by handling all data as tuples (i.e. rows in a table) and relations [60]. This model succeeds earlier designs such as the hierarchical model or the network model and offered a more flexible and efficient means of representing a wide variety of different data structures. The industry standard for interfacing with relational database management systems is the Standard Query Language (SQL). SQL is a declarative programming language, which means that it allows applications or users connecting to the database to describe what they would like to store, read or write from the database without having to tell the system exactly how to complete that operation. Systems built on the relational model are known as Relational Database Management Systems or RBDMS. In order to store data within an RBDMS, users must first define the structure of the tables, specifying the types of data that will be stored and describing their relationships. This data model is also known as a “schema”. The existence of a defined schema helps a relational system enforce data integrity and enable the efficient storage and processing of data. This requirement to have a defined schema prior to the storage of data is known as “Schema-on-Write”. In general, defining a robust data model that meets the needs of the application and the business is a time-consuming task for a sophisticated database designer. This process creates an upfront cost for businesses that must be paid before the first bit of data is collected and processed by the system. 60 Chapter 3 – A Case Study on Hadoop Beyond designing the schema, businesses implementing RBDMSs must also estimate the quantity of data to be collected, along with the pace at which data will be read or written in order to determine the combination of hardware and software that is needed to handle that load. This is known as “system sizing”. While it is possible to “scale out” RBDMSs retroactively after a system has been implemented and rolled out, this is generally a costly and difficult proposition for conventional relational systems due the fact that they are optimized for vertical and not horizontal scalability. According to Nikita Shamgunov, CTO of MemSQL, “enterprise-class database systems run well on powerful hardware, and there are many forces within the industry aligned to make this happen — not just software vendors, but also hardware manufacturers who want to show how many more transactions per second they can push on new hardware." [61]. As a result of this, significant engineering efforts have been invested into ensuring that RBDMSs possess the ability to leverage increased capacity on a single machine (“vertical scalability”). However, increasing the storage or processing power of an individual machine is not always a feasible option due to the limits in hardware. Expanding the capacity of the overall system by introducing additional machines (i.e. “horizontal scale out”) is challenging for conventionally designed database systems. Typically, such efforts require the repartitioning and redistribution of data in order to account for the new machines. This can introduce lengthy and costly interruptions to the operations of the system. As a consequence, conservative sizing practices are often adopted for relational databases, which further increase their upfront costs. Moreover, even conservative sizing is a difficult proposition for new and emergent “big data” sources whose eventual volume and velocity is unknown. Hadoop to the Rescue Hadoop offers a fundamentally different approach to the data management problem than conventional RBDMSs. While the Hadoop platform itself is a generic distributed storage and computing framework, various database management systems have been built on top of this framework (e.g. Apache HBase, Accumulo). These systems generally fall in the class of “NoSQL” (“Not Only SQL”) designs and are not relational in nature. Hadoop usage for “Big Data” also involve persistence and processing of data through raw files which would not be classically considered as database systems by computer scientists. Neither of these usages of 61 Platform Leadership in Open Source Software Hadoop require any significant pre-emptive modelling. Instead, the data can be persisted “raw” in the native output format of the data producer, and the “schema” provided at the time that the data is retrieved. This approach is known as “schema-on-read” or “late-binding”. As the structure of the data is not provided at the time of persistence, a schema-on-read system is unable to enforce the consistency or integrity of the data, nor can it optimize the storage for retrieval in the manner that “schema-on-write” systems do. However, this approach allows for the deferral of the significant upfront modelling costs associated with onboarding a new source of data with relational systems. Due to the fact that the workload required by Google grew at a rapid and unpredictable pace, MapReduce and the Google File System were designed to allow the company to add capacity in a cost-effective and flexible manner. The systems were built to run on hundreds or thousands of inexpensive ‘commodity’ machines, rather than a few expensive but powerful ‘server’ computers, to enable cost-effective and incremental capacity increases. Moreover, the frameworks were designed so that additional machines could be introduced to the system with little to no interruption to system operations and minimal human intervention. This property of “easy scalability” was inherited by Hadoop. Hadoop’s easy scalability, in combination with the low upfront onboarding costs enabled by its “schema on read” approach, makes Hadoop an attractive option for the persistence of new sources of “big data” whose value and magnitude is yet to be understood. Firms can costeffectively persist data inside Hadoop without being immediately concerned about how they would use it and be reasonably assured that Hadoop would scale with their needs. Hadoop also equips them with the flexible processing framework needed to extract the value from the data when the time comes. This approach to the management of “Big Data” has become so popular recently that the slang “hadump” has been recently coined by some industry observers to mock the fact that many Hadoop systems have become “dumping ground(s)” of unused data. Despite these criticisms, the fact that Hadoop offers a cost-effective solution to the problem of new emergent sources of data is genuinely valuable in light of the explosion of available data. Though other distributed and NoSQL technologies exist, Hadoop has become the leading platform in the big data management space, especially for analytical use cases. In their 2014 research report, Forrester Research characterized Hadoop as “a must-have data platform for 62 Chapter 3 – A Case Study on Hadoop large enterprises, forming the cornerstone of any flexible future data management platform” [62]. By 2015, the Gartner Group estimates that roughly two-thirds of analytical applications will have integrated Hadoop capabilities [63]. Incumbent enterprise software vendors such as Oracle, IBM and Teradata have taken notice as Hadoop is increasingly challenging their products as the “center of data gravity” in enterprise datacenters. According to a 2014 survey by Wikibon, Hadoop had displaced traditional data warehouses for some workloads in 61% of those surveyed; another 34% is expecting to shift some workloads over to Hadoop within the next six months [64]. While it is unclear if Wikibon’s sample is representative of the industry at large, the response does seem to echo the increasing interests by corporations in utilizing technologies beyond conventional relational data warehouses to manage the rising tide of new “big data” sources they face. Google Search Trends data back up this sentiment as the popularity of the term “Data Warehouse” declined while “Hadoop” and “Big Data” rose over the past decade (Figure 16). Google Trends (2004 - 2014) 120 100 80 60 40 20 0 hadoop big data data warehouse Figure 16 – Google search popularity of “Hadoop” and “Big Data” vs. “Data Warehouse” [65] 63 Platform Leadership in Open Source Software Architectural Overview While Hadoop originally referred to the open source implementation of the Google File System and MapReduce framework, the term “Hadoop” is now used to refer to the collection of technologies that has coalesced around those two original technologies. A diagram of the various components (along with some common open source or proprietary implementations) that were commonly found in the prototypical Hadoop application stacks at the time of writing is presented in Figure 17. Figure 17 – Major building blocks within a Hadoop application stack (popular proprietary / open source project fulfilling a given role in parenthesis). In the sections below, some of these key components are introduced to provide an understanding of how these individual components helped Hadoop become the de facto platform for Big Data. This understanding is necessary as a foundation for discussing the strategies of different platform competitors and their complementors. 64 Chapter 3 – A Case Study on Hadoop Distributed Storage The distributed storage layer within the Hadoop stack is responsible for managing the reliable and efficient persistence of data managed by the system. As Hadoop was designed to run on low-cost “commodity” hardware that can be prone to failure, the distributed storage layer is responsible for providing resilience in the face of hardware failures. It does so by managing redundant versions of the data across different machines transparently. Due to the fact data managed by Hadoop tends to be extremely large (i.e. measured in terabytes or petabytes), Hadoop assumes that it is more efficient to move “computation to the data” rather than the reverse and provides mechanisms to do this. The Hadoop Distributed File System (HDFS) was the component that was originally built to meet the needs of this layer and it remains the most popular component for storage within the extended Hadoop today. However, other options exist, including MapR’s proprietary Distributed File System, IBM’s General Parallel File System (GPFS), Amazon’s Simple Storage Service (S3) and UC Berkleys’ Tachyon [66]. Moreover, many non-Hadoop distributed NoSQL systems such as Apache Cassandra and MongoDB with their own storage subsystems have also been adapted to interoperate with the rest of the Hadoop stack. Job Managers and Coordinators The role of the Job Manager or Coordinator is to orchestrate the execution of computation across the many computing nodes within a Hadoop cluster. Originally, Hadoop was designed to handle only batch-based MapReduce jobs used for “embarrassingly parallel” (computing problems that are trivial to break apart and parallelize) tasks such as the pageindexing operation required by Apache Nutch. As a result, the original component for managing that computation was directly integrated into the component within the MapReduce processing framework itself. This component was known as the Job Tracker. As more data is deposited within the Hadoop file system, the desire to run different types of non-MapReduce programming models and interactive workloads has correspondingly increased. This desire required an ability for the framework to manage computing resources for these different use cases accordingly. As a consequence, a more sophisticated coordinator, Apache YARN (“Yet Another Resource Negotiator”) was created to handle different types of computing workloads [67]. 65 Platform Leadership in Open Source Software While the job manager is arguably one of the most central and critical components within the Hadoop stack, there is not much competition in this space. YARN’s only notable alternatives at the time of writing is the open source Mesos framework created by the Berkley’s AMPLab, as well as the processing framework specific job manager originally integrated into MapReduce. Distributed Processing Frameworks The MapReduce distributed processing framework that Cutting replicated in Hadoop was designed to offer a simple programming model for software developers writing highly parallelizable programs. Structuring a computational problem so that it can be reliably processed by a large number of computers in parallel had previously been a complex task. MapReduce solved this problem by requiring developers to break their algorithms into its two eponymous steps: “Map” and “Reduce”. The “Map” step partitions the required data into groups and the “Reduce” step processes the data within that group and summarizes it. For example, in order to find how many books any given author wrote in a large unsorted library of articles, a map function can partition the library based on the author’s name and the reduce function can count how many books are in each partition. MapReduce’s unique value is that it can execute this process for a library that is spread over thousands of computers and efficiently deliver the result. Figure 18 illustrates this process conceptually. Figure 18 – Diagram of basic MapReduce execution, taken directly from Jeff Dean and Sanjay Ghemawat’s original article on MapReduce [53]. 66 Chapter 3 – A Case Study on Hadoop As long as a given computing algorithm can be structured this way, MapReduce was able to ensure that it can be reliably distributed across massive clusters of computers. As it turns out, many complex algorithms can be decomposed into a series of MapReduce steps, making Hadoop a versatile tool for tackling all sorts of Big Data problems. However, the single-framework approach in Hadoop is suboptimal for a number of reasons. Firstly, as MapReduce was implemented as a batch-processing framework, it had significant overhead and inefficiencies that makes it unusable for interactive end user computing. Secondly, MapReduce stores all intermediate results back into the Distributed File System (partially as a means of ensuring failure resilience) which makes the framework inefficient for algorithms that tended to iterate over the same dataset over and over again. Many popular machine learning algorithms useful for extracting insight out of “Big Data” falls into this category of “iterative” algorithms. Finally, MapReduce was an inefficient programming model for most software developers. The framework forced developers to formulate their problems in an unintuitive way, which significantly diminished developer productivity [68]. The Hadoop community resolved this third issue of developer efficiency by developing new abstractions that sat on top of MapReduce. This included engines such as Pig and Hive, which enabled programmers to develop in PigLatin (a procedural programming language for data transformation) and SQL respectively. The community also developed libraries such as Mahout, which offered a repository of ready-made Machine Learning algorithms so that individual developers did not have to wrestle with MapReduce directly themselves. However, these efforts did not address the fundamental deficiencies that MapReduce had as a framework for interactive computing or iterative processing. Spark, a framework originating from Berkeley’s Algorithms, Machines and People (AMP) Laboratories attempts to address most of these problems. Originally created as a part of the AMPLab’s Berkeley Data Analytics Stack (BDAS), Spark was primarily developed by Berkley researchers independent of the Hadoop community. However, they worked to integrate their technology into Apache Hadoop and it has since been embraced by the Hadoop community. The AMPLab submitted Spark as an Apache Incubator project in June of 2013 and it was accepted as a top level Apache project in February 2014 [69]. 67 Platform Leadership in Open Source Software In addition to Spark, Apache Tez was also created to address the limitations in computational complexity of the original MapReduce framework. Both Tez and Spark were influenced by a Microsoft Research paper on a system called Dryad [70]. Although the original MapReduce was able to process complex algorithms by connecting one job with another, the framework was designed to handle a relatively simple two-stage processing pipeline with a single input and output at each stage. Dryad (and therefore Tez and Spark) offers an arbitrary number of inputs and outputs at each stage, enabling the expression of a complex processing graph. This fundamental improvement of the processing framework, along with YARN’s management framework improvements discussed in the previous section, are considered the core parts of the Apache community’s “Hadoop 2” efforts, which seek to make Hadoop a general purpose distributed processing framework rather than one used for batch processing [71]. Beyond offering the ability to execute more complex processing graphs, Spark also introduced “In-Memory” computing concepts to distributed processing through a novel abstraction known as the “Resilient Distributed Dataset” (RDD). The abstraction allows Spark to reliably avoid persisting intermediate results back to disk, enabling the framework to execute iterative workloads orders of magnitude faster than MapReduce or Tez. As a result of this advantage, Spark has attracted significant attention in both academia and industry. All major distributions of Hadoop now include Spark. Additionally, a growing set of applications, scripting engines and libraries have ported their MapReduce algorithms over to Spark. In July of 2014, Cloudera, Databricks, IBM, Intel and MapR announced a partnership to help the Hadoop community standardize on Spark as the “framework of choice” by porting popular components such as Hive and Pig to Spark [72]. At the time of writing, Spark appears to be positioned to succeed MapReduce as the de facto processing framework for Hadoop systems. Scripting Engines, Libraries and SQL on Hadoop As mentioned in the previous section, one of MapReduce’s major drawbacks is the fact that it forces software developers to rethink their algorithms using a framework structured for efficient processing by computers. In response to this, the Hadoop community created abstractions in the form of Domain-Specific Languages and libraries to enable developers to be more effective. For example, the Pig Scripting Engine offers imperative programming language (“Pig Latin”) that provides developers with operations and data structures useful for 68 Chapter 3 – A Case Study on Hadoop manipulating structured datasets. Developers are able to write their data transformation programs using this developer-friendly language and the Pig Scripting engine internally translates these operations into machine-friendly MapReduce jobs, thereby offering the massive parallelism and efficiency of Hadoop without imposing the burden of understanding the associated complexity on developers [73]. Similarly, the Apache Mahout project seeks to accelerate the development of machine learning programs on Hadoop by offering a library of common machine learning algorithms that developers can leverage. Unlike the other layers in the stack mentioned in the previous sections, as it is entirely possible to have a fully functional Hadoop system to exist without this layer. Therefore, these components should not be considered part of the core technical platform. However, from the perspective of evaluating Hadoop as an industry platform, these components are absolutely critical. Many of these libraries are so commonly used within their respective domains that they have become the de facto interface into the Hadoop platform. A large number of key complements that creates value for the Hadoop ecosystem depend on the interfaces presented by these components. Due to the fact that some complementary applications depend exclusively on the interfaces presented by these components, these layers actually make the underlying core platform-level components substitutable. One notable type of such components are those enabling SQL (Standard Query Language) connectivity to Hadoop data. As discussed in the previous section on the relation between Hadoop and the Big Data phenomenon, SQL is the industry standard used to interface with conventional relational databases and has been in heavy utilization since it was first commercially implemented in Oracle V2 in 1979 [50]. SQL and the emergence of middleware standards such as ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity) have made it possible for a large ecosystem of analytical software. Analytical software vendors are able to create tools that support a vast variety of databases from different vendors by simply obeying the SQL standard. The vibrancy of the analytics software market has created a large number of sophisticated tools for business users, analysts and data scientists to extract insights out of the data deposited within relational data sources. SQL-on-Hadoop allow these same tools to be connected to data stored within Hadoop. 69 Platform Leadership in Open Source Software The ability to use SQL to connect to Hadoop also makes it substantially easier to combine and integrate data between a traditional relational data store and data stored in Hadoop. This appeals to both users who seek to gain insight with data split across these two different types of systems as well as vendors of traditional data warehouses (e.g. IBM, Microsoft, Oracle, Teradata). Traditional vendors can maintain their positions as the “centers of data gravity” by allowing Hadoop data to be federated or “managed” through their relational data platforms. Table 11 enumerates some of the SQL on Hadoop offerings that GigaOm Research evaluated in 2013. Vendor / Community Product / Project Name Cloudera Impala Hadapt Adaptive Analytical Platform Teradata SQL-H EMC Greenplum HAWQ Citus Data Citus DB Splice Machine Splice Machine Apache Drill, Hive, Stinger JethroData JethroData Concurrent Lingual Table 11 - A selection of SQL on Hadoop offerings as identified by GigaOm Research in 2013 [63] Administration and Management Hadoop’s approach to scalability and resilience is fundamentally different than the strategy typically employed in traditional enterprise data centers. In order to offer cost-effective scalability, Hadoop was built to run on “commodity” hardware that are more prone to failure than the types of dedicated servers traditionally found within data centers. Moreover, as the 70 Chapter 3 – A Case Study on Hadoop number of computing nodes within a given cluster increases, the probability that there is a failure within the system at some point also increases. Given that it was built to operate clusters containing thousands of computing nodes, Hadoop (and its Google predecessor) was design to treat “failures as the norm rather than the exception” [54]. This approach fundamentally affects the work of the datacenter operator, who must become continuously involved by proactively maintaining the health of the cluster. While such a mode of operation is familiar to operators in internet / cloud services companies such as Yahoo! and Google, this represents a novel challenge for enterprise IT departments. This challenge is exacerbated by the fact that the community of developers who contribute to Hadoop tend to be employed by internet / cloud service companies. As a result, the administrative consoles in the community delivered-versions of Hadoop were originally designed to their preferences. For example, Hadoop originally offered only a rudimentary graphical user interface for managing the basic operations of the cluster, leaving the majority of configuration and management tasks for scripts, configuration files and API calls. This minimalistic approach was sufficient for the more technical developer-operators employed by internet / cloud service companies, but enterprise IT administrators tend to rely on graphical management consoles to simplify their work and have come to expect this functionality in most software that reside in their datacenters. Commercial vendors such as MapR and Cloudera have filled this gap with their own proprietary solutions and use their solution as a means of differentiating their offerings from that of the free community. A free and open source management console was not created until commercial vendor Hortonworks incepted the Apache Ambari project as a part of delivering its own distribution of Hadoop in 2011 [74]. 71 Platform Leadership in Open Source Software Market Overview In its 2014 research report, the Forrester group characterized the Hadoop market as a fragmented market where there were “lots of leaders, but none dominate” [62]. The researchers divided the market into the six major types of players described in Table 12. Name Description Apache Open Source Users can directly deploy what is made available by the open source community without engaging commercial firms. Pure play Hadoop A number of start-ups have emerged to profit with a “focus on Vendors developing, supporting and marketing unique Hadoop distributions, add-on innovations and services”. The “Big 3” of this group are Cloudera, Hortonworks and MapR Technologies. Enterprise Software Enterprise software vendors such as IBM, Oracle, Pivotal, SAP and Vendors Teradata offer Hadoop as a part of their own data management solutions. They do so either by creating their own Hadoop distributions or by supporting an existing distribution through partnership. Hadoop in the Cloud Cloud computing vendors such as Amazon and Microsoft have begun to offer “on-demand” Hadoop services. This allows enterprises to purchase Hadoop as a service and scale their Hadoop clusters up or down at a moment’s notice. Big Data Solution Solution providers are system integrators that design solutions Providers using technologies from others within the Hadoop ecosystem. Hadoop Accessories Forrester uses this group to refer to the tools and services that complement the core Hadoop platform. Table 12 – Breakdown of Hadoop-market according to Forrester Research [62] 72 Chapter 3 – A Case Study on Hadoop Amongst these six categories, the group that receives the most media attention is the pure play Hadoop Vendors. This group is led by three startups that have combined to raise over $1.6 billion USD in private equity and venture capital in the five years between March of 2009 and July 2014 (Figure 19). It is worth noting that the $1.6 billion was only the amount of funding that has been put into the three firms. The actual valuation of the three companies is significantly more. For example, Intel’s $740 million investment into Cloudera in March of 2014 was exchanged for 18% of Cloudera’s equity, effectively valuing Cloudera at over $4 billion. Cloudera, Hortonworks and MapR were all founded with the intent of bringing the Hadoop platform from the realm of specialized internet companies to the IT departments of enterprises. As such, these firms should be considered platform providers in the topology of Eisenmann, Parker and Van Alstyne. Millions Total Funding of Pureplay Hadoop Vendors (in Millions of USD) $1,800.00 $1,600.00 $1,400.00 $1,200.00 $1,000.00 $800.00 $600.00 $400.00 $200.00 $- Cloudera Hortonworks MapR Technologies Total Figure 19 – Cumulative Investments in Pureplay Hadoop Vendors according to CrunchBase in 2014 [75] Cloudera Hortonworks MapR Technologies Total 2009 $11 $$9 $20 2010 $36 $$9 $45 2011 $76 $48 $29 $153 2012 $141 $48 $29 $218 2013 $141 $98 $64 $303 2014 $1,201 $248 $174 $1,623 Table 13 – Cumulative Investments in Pureplay Hadoop Vendors according to CrunchBase in 2014 (in Millions of USD) [75] 73 Platform Leadership in Open Source Software While pure play Hadoop vendors tend to be the focus of analyst attention, they currently trail significantly behind enterprise software vendors in capturing value from the Big Data market. According to estimates provided by the research firm Wikibon, the combined revenue of the three major Hadoop vendors totaled approximately $163 million USD in 2013. As all three firms are privately held, Wikibon “triangulated” these numbers through discussions with various industry observers, company insiders and other sources. Consequently, these numbers must be used with caution. In fact, the November 2014 Form S-1 provided by Hortonworks as part of its initial public offering (IPO) application reveals that the gross billings of the company was substantially less than what Wikibon had estimated [76]. Nevertheless, these numbers are useful for illustrating the relative scale of the “Big 3” Hadoop pure play vendors compared to the scale of enterprise software vendors within the Big Data space (Figure 20) . Big Data Related Software and Services Revenue 2013 $1,000.00 $900.00 $800.00 $700.00 $600.00 $500.00 $400.00 Big Data Services Revenue $300.00 Big Data Software Revenue $200.00 $100.00 $- Figure 20 - Big Data-related Software and Services Revenue of the Top 3 Enterprise Software Firms vs. Pureplay Hadoop Vendors in millions of USD (from Wikibon, processed data in Table 18 of Appendix) [77] While the bulk of big data revenue for enterprise software firms comes from their traditional data warehousing products based on relational database technology, enterprise vendors are also making sizable investments into the Hadoop world. This investment is 74 Chapter 3 – A Case Study on Hadoop necessary as the market interest in Hadoop increases and as the attention of their customers shift towards solving the types of problems that Hadoop is well-equipped to solve (i.e. unstructured or semi-structured data, large data of unknown value and usage). IBM created its own Hadoop distribution in 2011 as a part of its IBM BigInsights analytical offering and have since built out a number of proprietary tools and technologies to work on top of Hadoop. EMC-spinoff Pivotal created a comparable offering in its Pivotal HD product line. Others like Microsoft, HP, Oracle and Teradata have partnership arrangements with Cloudera, Hortonworks and MapR to bundle, resell or redistribute their Hadoop distributions. This creates an interesting tension for enterprise software vendors as they need to rationalize and position Hadoop alongside their existing offerings. With the notable exception of Pivotal, the majority of these firms position Hadoop as a complementary component within a larger big data platform, rather than a platform itself (Table 14). Vendor Sample External Positioning of Hadoop IBM “New data management and analytic technologies are being implemented to complement rather than replace traditional approaches to data management and analytics. Thus Apache Hadoop does not replace the data warehouse and NoSQL databases do not replace transactional relational databases” [78] SAP “SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA and SAP BusinessObjects enabling a broad range of new analytic application” [79] Oracle “New big data technologies, such as Hadoop and Oracle NoSQL database, run alongside your Oracle data warehouse to deliver business value and address your big data requirements” [80] Teradata “Teradata Unified Data Architecture is the only truly integrated analytics solution that unifies multiple technologies into a cohesive and transparent architecture that leverages the best-of-breed complementary value of Teradata, Teradata Aster and open source Hadoop” [81] 75 Platform Leadership in Open Source Software Microsoft "The Microsoft Analytics Platform System is a no-compromise modern data warehouse solution that seamlessly combines a best-in-class relational database management system, in-memory technologies, Hadoop, and cloud integration in a turnkey package built for Big Data Analytics” [82] Table 14 – Sample Hadoop positioning statements by Enterprise Software vendors In addition to being some of their most formidable inter-network and intra-network platform competitors, enterprise software vendors are also some of the most valuable partners for pure play Hadoop vendors. Mega vendors such as IBM, SAP and Oracle are frequently the providers of the products and services that are critical complements for Hadoop. In fact, every single enterprise software vendor listed above possess partnership arrangements with either MapR, Cloudera or Hortonworks (Table 15). Cloudera Hortonworks MapR IBM X X X SAP X X X Oracle X X Teradata X X X Microsoft X X X Table 15 - Partnership matrix between pure play vendors and enterprise software vendors [83]–[85] Of course, enterprise software vendors are not the only providers of key complements for the Hadoop platform. “Hadoop in the Cloud” providers as well as independent software vendors (ISVs) creating “Hadoop Accessories” also play a critical role in completing the ecosystem. In the former category, two vendors are especially worth highlighting. 76 Chapter 3 – A Case Study on Hadoop According to the 2014 version of the Gartner group’s Magic Quadrant for Cloud Infrastructure as a Service, Amazon is the leading provider of cloud Infrastructure-as-a-Service, leading its only competitor (Microsoft) within the “leaders” quadrant by a significant margin in both “Completeness of Vision” and “Ability to Execute” [86]. Amazon built upon this leadership position to establish a significant presence in the Hadoop market with its Hadoop-as-a-Service (HaaS) offering, Elastic MapReduce (EMR). From a technical perspective, EMR differs substantially than the canonical Apache-based Hadoop stack in that key components (e.g. distributed storage layer) is substituted with Amazon’s proprietary web-services (e.g. Amazon S3). According to a study by Accenture in 2013, cloud-delivered Hadoop services is superior to on-premise “bare-metal” deployments in price, performance as well as flexibility [87]. As a result of these advantages, adoption of cloud-delivered Hadoop services, and consequently, Amazon’s influence over the Hadoop market, is expected to grow. The other “Hadoop in the Cloud” vendor worth mentioning is Berkeley start up Databricks. Databricks offers a cloud-delivered version of its Hadoop platform variant, featuring the open source Apache Spark technology. As discussed in the architectural overview section on distributed processing frameworks, key Hadoop players such as Cloudera, MapR and IBM have embraced Apache Spark as a likely successor for MapReduce and the framework is rapidly becoming the standard execution engine for Hadoop complements. At present, the firm’s only product offering (Databricks Cloud) is a nascent and nominal entry in the emerging Hadoop-asa-Service market. However, beyond employing Spark creator Matei Zaharia as its CTO, Databricks also employs 30% of all approved “committers” for Apache Spark, the most of any organization (Figure 21). This unique competency affords Databricks a disproportionately large reach and influence over the Hadoop ecosystem relative to its scale and size (Figure 21). 77 Platform Leadership in Open Source Software Committers to Apache Spark Databricks UC Berkeley Yahoo! Quantifind Mxit ClearStory Data Groupon National University of Singapore Webtrends Bizo Alibaba Imaginea, Pramati, Databricks Figure 21 - Official Committers to Apache Spark by Organization [88] Strategic Factors affecting Platform Leadership within the Hadoop Ecosystem In the following section, an analysis of the critical factors impacting a firm’s ability to direct the trajectory of the Hadoop ecosystem is presented using the framework outlined in Chapter 2. This analysis will focus on the three “pure play” Hadoop vendors as their long term success is most directly affected by their ability to harness the growth of the Hadoop platform. It is worth reiterating at this point that the objective of this thesis is not to assess which Hadoop firm is most likely to succeed. Instead, this analysis is intended to identify the firms’ assessments of the market forces and how their assessments affect their strategies. Given the rapid changes that are occurring daily within the Hadoop ecosystem, it is entirely possible that a firm’s perspective of the market will have changed by the time this thesis is published or consumed, rendering the detailed analysis content outdated. However, the observed behavior of each of the firms is consistent with its perspective at the time of writing, so the analysis is nevertheless useful for illustrating how a firm’s assessment of these market forces materially affects its strategies. 78 Chapter 3 – A Case Study on Hadoop Rivalry - Inter-network vs. Intra-network Competition One of the fundamental questions that a pure play vendor must answer for itself is whether its primary competition are other Hadoop providers (intra-network competition) or alternative platforms (inter-network competition) such as those offered by Teradata, Oracle or IBM. This assessment affects all aspects of the business’s strategy including how they position their products to the market, the technical areas within the platform that they choose to invest in, the partnerships that they choose to pursue and their interactions with the open source community. Current market share leaders Cloudera and Hortonworks appear to differ in their assessments of which competitive battle is most critical to long-term success. In October 2013, Cloudera began marketing its commercial Hadoop distribution as an “Enterprise Data Hub” and began articulating a vision that describes its Hadoop-based platform as a “unified data management platform” capable of addressing all data management needs of the enterprise [89]. The firm posits that Hadoop’s superior cost-effectiveness and flexibility makes it the natural “center of data centers as the first place data goes when it enters the enterprise, rather than at the side of the data center to solve a few, ancillary problems” [90]. While Cloudera has since been careful to clarify that it does not intend to position its product as an immediate alternative to specialized solutions like the traditional Enterprise Data Warehouses that its large partners offer, it has also been clear on its perspective that “workloads that belong in high-end enterprise data warehousing systems today, won’t in the future – and even high-performance, interactive analytic workloads will run in Hadoop” [91]. Cloudera describes its “Enterprise Data Hub” distribution of Hadoop as a complete data management platform for companies with a multitude of data management needs, and not as a point solution used to fill the gaps left by traditional relational technologies. In early 2014, Cloudera’s director of marketing Alan Saldich was quoted as saying that Cloudera has “many, many customers that are substituting an enterprise data hub built on Hadoop for incremental purchases of a whole range of data management infrastructure, including relational databases, enterprise data warehouses, storage, and mainframes". In the same article, he also asserted that Cloudera’s customers are not comparing its product to alternatives from Hadoop-vendors like Hortonworks, but rather to solutions from IBM or Teradata [92]. In other words, Cloudera’s ambitions are not 79 Platform Leadership in Open Source Software to become the leading provider of Hadoop-distributions but to become a leading provider of data platforms, Hadoop or otherwise (inter-network competition). Implications on Lever 3 (External Relationships) Cloudera’s view of Hadoop as a platform that can eventually displace traditional data platforms like the relational database or the enterprise data warehouse clearly differs from the public position of its rival Hortonworks. Rob Bearden, CEO of Hortonworks, describes Hadoop as a “rock solid” platform for processing unstructured data and expressed no desire to “reinvent the wheel” and compete with relational database vendors such as IBM, Oracle or Teradata. Bearden and Hortonworks espouse a “coexistence” view of Hadoop that more narrowly positions the technology as a complement to traditional data management technologies. Hortonworks has correspondingly invested in having its distribution “adopted and integrated seamlessly into [these] environments”. Bearden argues that Hadoop can extend traditional data platforms such as Teradata and “let it manage a much bigger data set, a 10 to 20 times bigger data set and have Hadoop as an extension of its architecture” [93]. In other words, Hortonworks does not see a need to position Hadoop as a viable alternative to traditional data management platforms but rather focuses on competing effectively against intra-network Hadoop competitors by offering superior integration into traditional data management environments. The different focus on inter-network and intra-network competition of the different firms impacts the ease with which the two firms are able to engage with other members within the ecosystem. In some respects, Hortonwork’s positioning of Hadoop as a complement of existing data platforms is more compatible with the perspectives of larger enterprise software vendors and has likely assisted the firm in striking lucrative reseller arrangements with some of these giants. HP, Microsoft, SAP and Teradata all have “strategic reseller” arrangements with Hortonworks [91]. In 2013, Hortonwork’s arrangement with Microsoft was responsible for over 55% of Hortonwork’s total revenue [76] and the Redmond giant actually embeds a variant of the Hortonworks Data Platform in its HDInsights offering on its Azure cloud platform. Similarly, the Teradata Portfolio of Hadoop redistributes a variant of Hortonwork’s HDP branded as “Teradata Open Distribution for Hadoop” [94]. One could argue the nature of Cloudera’s relationship with these enterprise vendors are of lesser intensities. For example, while Cloudera recently announced a significant go-to-market 80 Chapter 3 – A Case Study on Hadoop partnership with Teradata (as of 2014), Teradata still does not pay Cloudera for its technologies in the way it does to Hortonworks; Teradata trails only Microsoft and Yahoo! in terms of contribution to Hortonworks’ top line [76]. While Cloudera’s distribution is supported and has been sold by HP as part of its AppSystem solution since 2012, HP made a $50 million USD investment in Hortonworks in 2014 that appears to reflect its partnership preference [95]. However, some industry analysts have observed that enterprise software vendors are finding that it is advantageous “to be polygamous in their relationships with Hadoop distro providers” [96]. Consequently, it is unclear if Hortonworks’ positional advantage in collaborating with enterprise technology vendors is material or sustainable. Implications on Lever 1 (Scope of the Firm) While Cloudera’s intention of competing with platforms beyond Hadoop may have impeded its collaboration with enterprise partners, this ambition appears to provide the firm with a vision for the future of the Hadoop platform that has advanced the platform forward. This vision has also affected the firm’s “Scope of the Firm” (Lever 1) decision making. One example of this is Cloudera’s decision to invest in the creation of a “Fast SQL” engine for Hadoop called Impala. In a private interview completed for this thesis, the firm’s chairman Mike Olson shared that it was obvious “that many [existing proprietary SQL engines] would eventually be ported to Hadoop, because Hadoop matters” [97]. Given such a perspective, a firm focused on competing effectively with other Hadoop vendors would likely choose to partner with an incumbent SQL vendor, rather than build yet another competitor in the crowded space and compete with its own SQL entry. However, as SQL was a crucial interface that connects a significant number of pre-existing complementary applications, Cloudera believed that it was crucial for a fast SQL engine to be part of Hadoop’s “open core”, rather than be a proprietary component external to the core platform. As a result, the firm invested into developing Impala as a Cloudera-governed open source project. The firm believed that the inclusion of Impala as part of Cloudera’s distribution of Hadoop was necessary to bolster its competitiveness against traditional data management solutions. Cloudera’s decision reset expectations of what should be available “out-of-the-box” within a Hadoop distribution and sparked the development of additional fast-SQL-in-Hadoop projects like Apache Stinger, which have further bolstered the viability of the Hadoop platform. 81 Platform Leadership in Open Source Software Olson cites Cloudera’s integration of Apache Solr-based search and its embrace of Apache Spark as other examples of the firm’s continued thought leadership in making Hadoop the leading bigdata platform. One can argue that Cloudera’s focus on competing effectively against alternative platforms has caused it to push the boundaries of the Hadoop platform extent, benefiting the growth of the ecosystem as a whole. While Hortonworks and MapR have also developed a number of significant new technologies that improve Hadoop’s viability, they have been less aggressive in changing the platform extent of Hadoop by introducing new capabilities to the platform. Rather, the firms have focused their efforts on offering improved implementations of capabilities that were already available in the market. This differing investment philosophy reflect the focus of the pair on competing effectively within the Hadoop ecosystem rather than competing beyond it. For example, while Hortonworks has been responsible for the engineering horsepower behind substantial projects such as Ambari and Stinger, these projects were not pioneering new grounds in offering new functionality to the Hadoop market, but rather Apache-community implementations of capabilities that were already available to the ecosystem at large through proprietary extensions developed by companies such as Pivotal and Cloudera. Similarly, MapR has focused its proprietary engineering effort on offering superior implementations of core components such as the distributed file system, as it strives to become the most enterprise-ready distribution of Hadoop available. While this allows MapR to differentiate itself from the likes of Cloudera and Hortonworks, its investments in this area also do not introduce new platform functionality that equip Hadoop to compete against alternative platforms. As the largest pure play vendor in the Hadoop market and the “incumbent leader” due to its first-mover advantage, Cloudera’s growth is unlikely to come at the expense of its intranetwork rivals [77]. It is unsurprising that its search for growth has led it to look to the broader data management market and engage in inter-network competition. Conversely, MapR and Hortonworks recognize that maximizing the growth potential of the Hadoop market require them to capture a greater portion of the market. Consequently, it is also natural for these firms to focus on intra-network competition. Generalizing from this case-study, one can infer that a platform firm’s focus on inter-network vs. intra-network competition is heavily influenced by the extent to which it is currently positioned to capture the growth in the platform. If a firm is already 82 Chapter 3 – A Case Study on Hadoop positioned to capture the majority of growth in a given platform, it will focus on inter-network competition and winning against alternative platforms. Firms that are not the primary beneficiaries of a platform’s growth will instead focus on growing their market share and on intra-network competition. Suppliers - Securing the Upstream Value Chain As briefly mentioned in the market overview section, Intel made a substantial investment of $740 billion USD in Cloudera at the beginning of 2014, acquiring 18% of the company [98]. This investment was not only noteworthy in terms of its magnitude, but also because it represented a rapid and surprising shift in Intel’s approach to the Hadoop market. Intel had entered the Hadoop market developing and bringing to market its own Intel Distribution of Hadoop that was optimized for its microprocessors only a year earlier [99]. In a private interview conducted for the purpose for this thesis, Intel’s Big Data GM Ron Kasabian explained Intel’s initial motivation for getting into the Hadoop market stemmed from a desire to accelerate the adoption of Hadoop in the enterprise. It believed it could do so by introducing the “enterprise-hardening” features to the platform that Intel believed were critical for mass adoption amongst enterprises. Intel understood the challenges and opportunities of Big Data itself as it faced an explosion of data in its own operations; Kasabian shared that Intel’s own factories generate as much as “five terabytes of data every hour”. With an internal estimate of 94% market share in the datacenter microprocessor market, Intel believed that it would be one of the primary beneficiaries of mass enterprise adoption of the computationally intensive Hadoop technology. At the time of Intel’s investment into the Hadoop space in early 2011, there were no industry leaders within the Hadoop ecosystem which Intel felt was equipped to drive adoption of Hadoop within the enterprise. Intel decided to invest in the technology, initially believing that it could not only accelerate the adoption of Hadoop, but also that it could become the market leader in the space given the firm’s unique complementary assets. The firm believed that by optimizing Hadoop for its own microprocessors, it could not only reinforce its dominant position in the growing Hadoop market, but also compete effectively with other Hadoop vendors on the virtues of superior performance. 83 Platform Leadership in Open Source Software Despite some initial market success, particularly within China where Intel’s distribution of Hadoop was number one in market share, Intel decided that its Hadoop objectives were better met by investing into Cloudera rather than continuing with its own distribution. Beyond taking an 18% stake in the pure play vendor, Intel also agreed to cease development of its Intel Distribution and have its engineers bring its optimizations into Cloudera’s distribution. Relation to Lever 2 (Product Technology) According to Kasabian, one of the reasons that Intel decided to abandon its own distribution in favor of partnering with a Hadoop pure play vendor is the fact that it wanted to drive its optimizations back into the core of the Apache governed projects in order to affect the ecosystem in the manner it desired. Although it had a number of Apache-approved committers on staff, Intel had far fewer committers than either Hortonworks or Cloudera. By partnering with a pure play Hadoop vendor, Intel was much more likely to get its patches contributed back into the Apache core projects and adopted by the broader community, including other Hadoop vendors and their customers. # of Committers to Hadoop-related Apache Projects by Company 90 80 70 Zookeeper 60 Tez 50 Spark 40 Pig 30 Hive 20 Hbase 10 Hadoop 0 Accumulo Figure 22 – Hadoop Contributors by Organization - Hortonworks and Cloudera employ the most Hadoop committers out of all Hadoop - Yahoo and Facebook are Hadoop users and not vendors – Data extracted and analyzed from various projects website at www.apache.org. 84 Chapter 3 – A Case Study on Hadoop Intel’s assessment reflects the criticality of securing unique access to the critical resources (i.e. influential members of the open source community) in ensuring that a firm is able to influence the architectural trajectory of an open source platform. While the majority of open source projects operate as meritocracies with a distributed center of authority, the truth is that a small subset of contributors (“committers” in the vernacular of the Apache Foundation) are responsible for the majority of technical decisions within a project at any given time. This core group also tend to remain relatively stable for a given project. If a firm is to be a leader of an open source platform, it must have access to such individuals for key projects; this access is a prerequisite for wielding to “Lever 2” (product technology) of platform leadership. Beyond gaining access to the committers that affect technical decision making, a software firm must also staff itself with individuals who deeply understand the technology used by its customers. Moreover, firms must convince the market at large that it has done so. In a market such as enterprise software, the perceived pedigree of a firm’s engineering staff can be a major consideration in the deliberate and scrutinized purchasing process. As a consequence, one can argue that the competitive advantage of employing technical contributors in the community stems as much from its marketing value as the actual engineering capability gained. Being active and visible in the open source community is one way for open source firms to convince customers of its competency in a particular open source technology. Cloudera and MapR have engaged in very public debates regarding which firm has contributed more to the open source development of Hadoop for this reason [100], [101]. In fact, when Hortonworks was spun out of Yahoo! in 2011, one of its primary marketing messages was that it employed more experienced Hadoop contributors by virtue of its Yahoo! lineage than any other company and thus, was the best equipped to support Hadoop in the wild [102]. Open source platform contenders in such markets also tend to employ highly visible community leaders as a part of their management team for a similar reason; MapR, Cloudera and Hortonworks all employ highly visible members of the Hadoop community as part of their senior leadership teams. Collaboration with Open Source Community Like all open source vendors, pure play Hadoop vendors face a number of options for managing new intellectual property. There is an expectation for open source software vendors to contribute some of their innovations back to the community. Such contributions can be 85 Platform Leadership in Open Source Software strategically advantageous to the firm, as they can be means of influencing the trajectory of the technology in the firm’s favor. However, it may also make sense for a firm to withhold an innovation for itself as a means of differentiating its platform variant. If a firm opts to develop in open source, it also needs to make a conscious decision of doing so under the governance of an independent community such as the Apache Software Foundation, or to drive the process itself. Within the Hadoop market, Hortonworks is the only vendor that has publically committed to maintaining a 100% open source development model. The firm does not only do all of its development in open source but also commits to developing “exclusively via the Apache Software Foundation process” [103]. While this decision has implications on the company’s business model (the company has no unique product to license and therefore, can only be a strictly services company), it does help the firm win the minds and wallets of its customers in a number of ways. The company’s contributions and commitment to open development create considerable brand equity for Hortonworks within the Hadoop community itself, and this equity can translate into influence within the community and credibility with customers in the marketplace. The firm also heavily markets the danger of vendor lock-in that can occur with a Hadoop distribution that is not fully open and capitalizes on this message by publicizing its unique position as the only firm at scale to sell enterprise support for a truly open distribution [104]. Although developing purely in the Apache model allows Hortonworks to differentiate itself from the other vendors, it also means that the firm is limited with regards to how it can influence the ecosystem to its exclusive benefit. Despite the fact that Hortonworks is the plurality leader in terms of employed Apache committers in Hadoop-related projects, its workforce still represents only a fraction of the community. Therefore, Hortonworks cannot make technological decisions for the platform unilaterally. Moreover, while Hortonworks shares all of its technology with its competitors, they do not necessarily reciprocate. Consequently, Hortonworks may find itself occasionally trailing its competitors when it comes to the features and functions that are included with its platform variant. Unlike Hortonworks, MapR and Cloudera engage in both proprietary and open source development. Both firms employ an “Open-Core” model, offering a for-profit commercial product by extending open source technologies with proprietary extensions. Cloudera’s Mike 86 Chapter 3 – A Case Study on Hadoop Olson and MapR’s John Schroeder have both written public articles explaining why the development and possession of proprietary intellectual property are necessary for creating a sustainable businesses that their enterprise customers can rely on [3], [105]. Despite asserting that the possession of proprietary intellectual property is a necessity for sustained profitability, both Cloudera and Hortonworks still engage in open source development for some of their new technology initiatives. This relates to the observation made at the beginning of this thesis, which is that establishing new proprietary platform standards has become tremendously difficult, and open source development is a tactic that can be deployed to accelerate industry adoption. If having a standard implementation or if sharing engineering resources across the entire ecosystem is in the interest of the individual vendors, then it is typically best for that project to be governed by an independent authority such as the Apache Software Foundation. In Hadoop, an example of such a type of technology would be the distributed processing framework itself. Processing frameworks such as Tez or Spark is simultaneously too complex for an individual firm to develop and too important to the ecosystem to be fragmented by proprietary development. The firms are better off collectively contributing to the improvement of this unifying platform component than to risk fracturing the ecosystem in the hopes of differentiation. If a vendor believes that a given innovation is likely to help differentiate its platform variant, then it may prefer to keep the technology proprietary and its source closed. This allows the company to differentiate its solution and maintain control over its technology. However, if this differentiating technology occurs in an interface component that sit between the platform and a type of complement, the firm may need to develop it in open source in order to encourage broader adoption. In such a scenario, the firm may opt to do so without involving an independent authority such as the Apache Software Foundation. This allows the firm to maintain maximum control over the project while avoiding the stigma of being a proprietary or closed technology. Of course, such a project structure is unlikely to benefit from the resource pooling of an independently governed project as competing platform vendors are unlikely to contribute. Cloudera’s Impala is a prime example of such a project. Figure 23 attempts to summarize the considerations for development model selection discussed above into a simple decision tree for a given area of innovation. 87 Platform Leadership in Open Source Software Figure 23 – Proposed decision-tree for selecting between a proprietary, sponsored open source and community governed open source model for a given innovation (original creation). 88 Chapter 3 – A Case Study on Hadoop Complementors - Identifying and Securing Critical Complements As mentioned in Chapter 2, both proprietary and open source platform contenders need to be proactive in managing critical platform complements. However, open source platform vendors face a unique challenge in that a community organization such as the Apache Software Foundation may act as a broker for connecting the platform to its complements. In the topology of Eisenmann, Parker and Van Alstyne, the Apache organization effectively acts as the platform sponsor. As a consequence of this, open source vendors cannot use some of the levers afforded to proprietary firms (such as developing exclusive interfaces with Lever 1) for securing unique complements to the open platform. Instead, the firms must rely on alternative techniques, such as the development of partnership arrangements or the introduction of proprietary interface components to regain that leverage. Partnership Programs All three pure play Hadoop vendors boast robust partnership programs for key complements. For Hadoop, one primary type of complements that enhances the value of the platform are applications created by independent software vendors (ISVs) that specialize Hadoop to a specific market or function. As all three firms shares a significant number of public interfaces governed by the Apache Software Foundation process, applications that work on one vendor’s Hadoop distribution tend to work also work on another’s. However, in the Enterprise Software market, “officially supported” software and “technical compatible” components are hugely differentiated and most enterprise information technology departments are only willing to adopt the former class of software. Consequently, all three firms offer partnership or certification programs in order to assure potential customers that solutions created by independent software vendors are fully supported on their platforms. Given the common architectural foundation and interfaces for the three vendors, it is difficult to curate a fully differentiated partner ecosystem on the basis of applications alone; a software vendor that has built software for Hadoop face very low barriers for multi-homing across the different distributions. Table 20 of the appendix shows a composite matrix listing the independent software vendors and technology partners that Cloudera, Hortonworks and MapR claim on their respective company websites as of November 2014. The three firms do not only boast an extremely similar number of such partners (Cloudera: 164, Hortonworks: 156 and 89 Platform Leadership in Open Source Software MapR: 159), but they also have a substantial number of partners in common. The majority of Hortonworks and Cloudera’s ISV partners have a relationship with at least one of the other pure play Hadoop vendors. Non-exclusive Partnership 76 Exclusive Partnership 66 100 88 90 59 Cloudera Hortonworks MapR Figure 24 - Analysis of Exclusivity of Partnership Arrangements for Hadoop ISVs Given that it is difficult to differentiate a given open source platform variant from its intra-network rivals based on the primary types of platform complements (i.e. applications), aspiring platform vendors may need to look to attract other types of complements for differentiation. Firms may form different opinions about which types of complements are most valuable beyond the primary complement types. In a private interview, Mike Olson shared that Cloudera actively pursued the Intel partnership because it recognized the unique competitive advantage offered by visibility into Intel’s roadmap and access to Intel’s unique engineering talent [97]. Olson believed that customers would greatly value the superior performance that an Intel-optimized Hadoop distribution would hypothetically offer. Ron Kasabian of Intel later corroborated by stating that one of the primary reasons that the Cloudera leadership team appeared to appreciate the value proposition that Intel was bringing to the table more than its competitors. This anecdote illustrates the different emphasis that each firm may place on different complement types. 90 Chapter 3 – A Case Study on Hadoop Buyers - Controlling the Path to the Customer Mediating the Purchasing Process of Complements Enterprise software products are inherently complex systems. New products brought onto a customer’s landscape must integrate with numerous systems that already reside there. These existing systems are used, administrated and developed by different individuals and organizations using different technologies from different eras. Consequently, implementations of enterprises software systems are often expensive and lengthy endeavors that may require millions of dollars and hundreds of man-years to complete. As a result of the significant investment that they represent, the purchasing process of enterprise software systems also tend to be long and complicated. Enterprises often enlist the help of multiple parties, including consultants like IBM or Accenture to help them make the best possible decision when selecting their vendors. The complexity of this purchasing process is simultaneously an opportunity and a challenge for Hadoop vendors. As mentioned in the corresponding section in Chapter 2, aspiring open source platform contenders can differentiate their platform by mediating the purchase process of complements for customers. Android platform providers attempt to do this by providing electronic application marketplaces in order to simplify the user-driven acquisition process of complementary applications for their platforms. The platform providers’ involvement in complement delivery also provide them with a channel to influence and govern the behavior of complement creators. Unfortunately, due to the more elaborate purchasing process of enterprise software, Hadoop vendors cannot take complete ownership of the application purchasing process in the manner that mobile platform vendors have attempted to. However, Hadoop vendors still attempt to actively participate in that process to help expedite it and to exert influence. One specific way they attempt to do that is through the creation of partnership or certification programs for their platform. As mentioned in the previous section on securing critical complements, Cloudera, Hortonworks and MapR all offer programs to help assure the customer that a given complement provider’s product is compatible with their platform variants. However, certification programs are also intended to serve a few additional purposes. The programs are also intended to simplify the application selection process by helping customers identifying the complement vendors 91 Platform Leadership in Open Source Software available to them. Cloudera describes this intent on their website in the following manner – “The Cloudera Certified Technology program is designed to make choosing the right technology easier. When you see the Cloudera Certified Technology logo, you can trust that the product has been tested and validated to work with CDH, our 100% open source and enterprise-ready distribution of Apache Hadoop and related projects.” [84]. To this end, each of the firms dedicate prominent sections of their company websites to help customers find potential complement vendors. Beyond helping customers identify the right partners, the certification process also provide an opportunity for aspiring platform contenders to exert influence over the behavior of complement creators. For example. Hortonworks explains that technology certified through its partnership program “are reviewed for architectural best practices”, while Cloudera states that it verifies that its partners “comply with Cloudera development guidelines for integration with Hadoop”. These review processes give the firms an opportunity to guide a partner organization towards integrating with platform components and interfaces in a manner favorable to them. For example, Hortonworks and Cloudera offer very different administrative environments for their platform variants, with Cloudera offering its proprietary Cloudera Enterprise Manager and Hortonworks offering a similar environment in Apache Ambari. Though neither firms impose this today, it would be possible and reasonable for the firms to require that a complement provider integrate into their specific administration consoles in order to achieve certification. As complement producers are often smaller vendors that depend on the endorsement of the platform providers to reach potential clients, the certification process acts as a powerful bargaining chip for the platform contenders to influence their activities. Even enterprise software vendors which exceed the pure play Hadoop vendors in scope and scale lack the expertise and credibility of the pure play vendors within the Hadoop market and may need to look to these certification programs to gain credibility with their customers. This offers smaller vendors additional leverage to bargain against their powerful competitors. Beyond participating in the purchasing process of complements, Hadoop vendors also seek to influence the purchasing process for the platform itself by partnering with key stakeholders of the purchasing process. In enterprise software, major influencers in the purchasing process include system integrators and IT consultancies such as Accenture, Infosys 92 Chapter 3 – A Case Study on Hadoop and IBM Global Services as well full-stack mega vendors such as Oracle, IBM and SAP. In enterprise software, it is not uncommon for some of these firms to become so embedded within the operations of a large enterprise that the endorsement and approval of these firms can determine whether or not a smaller vendor will be considered for a deal. All three of the existing Hadoop pure play vendors recognize this and have partnered with these firms as a means of ensuring that their platform variants are considered in the selection process. One may infer from the prior discussion on Hortonworks’ close product and reseller partnerships with enterprise mega vendors that it holds a distinct advantage over its competitors in this regard. However, due to the organizational separation between the product and services organizations that exist in most of these vendors, the impact of those partnerships on the vendor selection appears to be minimal. Cloudera Amazon Google Microsoft N/A N/A CDH Available via Azure Marketplace Hortonworks N/A N/A Directly integrated into HDInsights MapR Directly available as Exclusive Distribution EMR Option on GCE N/A Table 16 - Partnerships between pure play Hadoop vendors and leading cloud IaaS vendors – sourced from company websites One interesting type of partners for Hadoop platform vendors are cloud Infrastructure-asa-Service vendors such as Amazon, Google and Microsoft. These three software giants each offer their own Big Data solutions as a Service-based offering (Elastic MapReduce for Amazon, BigQuery for Google and Azure HDInsights for Microsoft) that compete with the offerings of the pure play Hadoop vendors. The three software giants possess a significant go-to-market advantage over the pure play vendors as they are able to offer both the software as well as the underlying hardware infrastructure in a single package, significantly simplifying the overall acquisition process for a big data solution for customers. Pure play vendors have attempted to nullify this advantage by integrating with some of these cloud vendors; Table 16 enumerates 93 Platform Leadership in Open Source Software some of the integrations that have been pursued. Of the three pure play vendors, Cloudera is arguably the least integrated into the offerings of these leading cloud vendors, with its only integration point being the availability of its solution via the solution marketplace offered by Microsoft’s Azure. According to some industry observers, this is the reason that the company has pursued and heavily marketed its partnerships with some of the smaller cloud operators [106]. Substitutes and New Entrants - The Threat of Shifting Platform Boundaries In an August 2014 article to the Association of Computing Machinery (ACM), noted MIT adjunct professor of computing science and database luminary Michael Stonebraker observed that what exactly makes a Hadoop solution Hadoop is fairly ephemeral [107]. Stonebraker pointed out that the MapReduce-based distributed processing framework that had been synonymous with Hadoop has been abandoned by newer projects like Cloudera Impala; Impala uses its own optimized distributed processing engine which accesses the Hadoop Distributed File System (HDFS) directly. Apache Spark, originally developed independently of the Hadoop ecosystem, is now embraced by the Hadoop community to an extent that it joins the original Hadoop MapReduce framework and its successor (Tez) as standard processing frameworks for Hadoop. A consequence of this, Stonebraker observed that only thing that seems to be a condition for a platform to be labelled as “Hadoop” is the usage of the HDFS as a storage and persistence at the bottom of the technology stack. 94 Chapter 3 – A Case Study on Hadoop Figure 25 - Hadoop in 2011 vs. 2014 – A “Hadoop” deployment in 2011 always contained the components that were considered part of the ‘core’ Hadoop platform, and likely a component that was a part of what was considered the extended platform. A deployment of Hadoop in 2014 may not include any components of either sorts. Though the focus of his article was on another topic, Stonebraker was pointing the continuous and substantial shifts in the platform boundaries of Hadoop. At Hadoop’s origin back in 2007, the MapReduce and HDFS framework were clearly defined as “core” to the Hadoop platform. Subsequent efforts like Apache Hive built upon this core and were so useful and ubiquitous that they were effectively considered a part of the extended platform. Subsequent requirements and new technologies have emerged to displace components that were even of this original platform. In fact, the shifts have been more substantial and the definition of “Hadoop” murkier than even what Stonebraker posited. The usage of HDFS cannot be relied upon as a condition for defining what constitutes a “Hadoop” distribution; the MapR distributions of Hadoop do not use HDFS at all but rather the MapR Distributed File System mentioned earlier. Consequently, a MapR customer that uses only Impala in their “Hadoop” implementation in 2014 will not use any major components that would be considered “core” to Hadoop only a few years earlier (Figure 25). Of course, this begs the question – what exactly is the “Hadoop Platform” if it is not defined by any specific technology or major component? 95 Platform Leadership in Open Source Software Figure 26 – Displacing Core Components – The fact that platform-internal APIs are well documented in open source systems allow core components to be substituted (HDFS MapR NFS). Dependent components (HIVE Shark) can also be forked and adapted in the case where clean substitution is not possible (MapReduce Spark). . Given that Hadoop is an industry platform mediating the Hadoop ecosystem, one answer may be that “Hadoop platform is a collection of technologies that binds together the Hadoop ecosystem”. While this definition seems tautological on the surface, it actually addresses the puzzle at hand. While the MapReduce engine and the HDFS encapsulated the original valuegenerating intellectual property that motivated the genesis of the ecosystem, they are not the technologies that bind the ecosystem together. That responsibility lies with the relatively simpler interfaces and interfacing subsystems that allow these components to be connected to one another. By this logic, any technology stack that provides these interfaces ought to be considered a “Hadoop Platform” provider, even if the technology do not share the same lineage or development leaders as the others. For example, MapR’s product is considered a Hadoop distribution, because the company provided a proprietary alternative of HDFS that was both API and wire protocol compatible with the HDFS components. It is worth noting that while this would have been theoretically possible in a proprietary platform as well, this type of low-level component substitution is extremely unlikely to happen in proprietary platforms as interfaces between internal platform subsystems would not have been documented and easily substitutable. Interestingly, even in the cases where a platform component cannot cleanly fulfil the interfaces of a core platform component, the open source nature of the platform meant that dependent components can be forked and adapted by motivated parties. For example, although 96 Chapter 3 – A Case Study on Hadoop Apache Spark could not provide exactly the same Mapper and Reducer APIs that the original MapReduce engine provided for its clients, the Spark team was able to modify popular dependent components like the popular Hive component to run on top of its engine (Figure 26). The ability to fork the Hive source code allowed the Spark team to create a component (Shark) that offered the same client interface as Hive (HQL) and maintain compatibility with dependent products and applications. The ability to fork and adapt existing components also allow internetwork competitors to hijack key platform components and complements to their competing platforms. The MapReduce engine, as well as Hive, Spark and Shark, have been forked and ported to alternative data management platforms such as Apache Cassandra by inter-network competitors such as Datastax [108]. The availability of internal interfaces and implementation source code has allowed Hadoop to evolve rapidly. However, it also represents a challenge for commercial vendors attempting to influence the platform’s trajectory. Neither examples of technology substitution presented above were made with the approval of a clear platform sponsor. The Apache Software Foundation, typically viewed as the “platform sponsor” for Hadoop, did not make a conscious decision that Spark or Shark were worth developing and adopting; Spark became an Apache project after its first release and initial adoption. Ultimately, the organisms that determined what technologies would be considered part of the Hadoop platform was the market at large and the ecosystem as a whole. Technologies that were sufficiently compatible with existing ecosystem solutions with a sufficiently compelling value proposition were eventually adopted broadly across the entire market as intra-network competitive forces drive platform vendors to adopt the best-in-class technology. This meritocratic nature of platform governance in Hadoop also creates tremendous opportunities for new entrants to enter the market. As mentioned in the Hadoop market overview, the immense popularity of Apache Spark has allowed commercial vendor Databricks (founded and led by Spark’s creators) to gain tremendous influence over the Hadoop ecosystem despite its late entry and limited scale of operation. In a closed-source ecosystem, it would have been exceedingly difficult for such a small firm to penetrate what had already become a very large ecosystem with leaders operating at scale. This examples also suggests that open source platform leadership is generally less stable than conventional platform leadership. The fact that 97 Platform Leadership in Open Source Software technology can be adopted and displaced at the whims of the market means that the core technical competencies that a platform leader has built up can quickly be invalidated if a better mouse trap emerges. If Apache Spark continues on its current trajectory as the de facto distributed processing framework of the Hadoop ecosystem, then the value of Hortonwork’s technical expertise on the MapReduce and Apache Tez technologies could diminish drastically. As a result of all of this, an aspiring open source platform leader cannot rely solely on the gravitational pull of platform complements to maintain its leadership positions, as the platform and the ecosystem can be ‘hijacked’ in the manners described above. It must stay atop of new technologies and be ready to embrace them in order to maintain its understanding of the ecosystem. Platform vendors vying to build their businesses upon open source technologies must remain open to technological change as the market will largely determine the platform standards for the ecosystem independent of them. 98 Chapter 4 - Conclusion In his book “Only the Paranoid Survives”, Andrew Grove wrote the following in reference to technological changes: “ARE SUCH DEVELOPMENTS A CONSTRUCTIVE OR A DESTRUCTIVE FORCE? IN MY VIEW, THEY ARE BOTH. AND THEY ARE INEVITABLE. IN TECHNOLOGY, WHATEVER CAN BE DONE WILL BE DONE. WE CAN’T STOP THESE CHANGES. WE CAN’T HIDE FROM THEM. INSTEAD, WE MUST FOCUS ON GETTING READY FOR THEM.” [4] This quote certainly seems to apply to the world of open source software. While open source platforms benefit from the same network effects that proprietary platforms enjoy, the ability of a single firm to harness and direct that growth is greatly hindered by the increased pace of technological change afforded by the open intellectual property. Aspiring platform contenders in the open source world cannot rely on their exclusive possession of key platform technology to direct the behavior of complementors. Instead, they must assess and increase their influence over the forces that affect the market in order to shape the movement of the ecosystem as a whole. Given all the variables that are beyond the control of an open source platform contender, perhaps the image of a “platform leader” as a powerful orchestrator that directs an ecosystem by manipulating the levers of its platform empire is an inappropriate one. With fewer levers of power at its disposal, an open source platform leader is perhaps more like politicians in modern democracies, leading through influence and relationship building rather than power and authority. Moreover, continued possession of a leadership position greatly depends on a firm’s ability to survey the sentiments of its constituents and adjust accordingly. Open source platform contenders must clearly assess whether or not their primary rivals sit within the same ecosystem (i.e. network) and then look up, down and across the value network to ensure unique or superior access to the suppliers, buyers and partners that make up the market. Beyond this, such vendors must stay abreast of the technological changes that can emerge to reshape the make up the market or risk having all their efforts quickly invalidated by the emergence of new technologies that reshape the market landscape. 99 Platform Leadership in Open Source Software While this image of an open source platform leader as a politician is arguably less aweinspiring of the previous image of a powerful empire leader, it is perhaps a more prevalent and relevant depiction of platform leadership in the current software market landscape. With few exceptions, the open source model is becoming the preferred approach for establishing new platforms, in the same way that most modern modes of governance are democratic and not authoritarian. Software vendors that seek to operate within platform markets must accept this new reality and adapt appropriately. Areas of Further Research Much of this thesis drew upon the case studies of two open source platforms: Google’s Android and Apache Hadoop. These two platforms were selected for their relevance to the consumer and enterprise software markets, as well as the difference in structure and origins between them. The Android case study illustrates how a firm may choose to establish an open source model for its own technology under its own terms, and yet still be subject to the fluid nature of open source platform dynamics. The Hadoop case study illustrates how firms can work to establish leadership positions for an external platform technology made available through the open source community and still extract enormous value. Despite the purposeful selection of the two case studies, the fact that this thesis studied only two platforms is a limitation and further research to analyze other platforms may be completed to identify additional factors, tactics and strategies relevant to open source platform leadership. While this thesis was written with an understanding of the different business models that are available to open source ecosystems, the analysis did not systematically consider the impact of these different business models on the firms’ behaviors with regards to platform strategy. Moreover, the findings of this thesis were descriptive in nature and would be complemented by further works to establish a prescriptive framework for managing open source platform leadership. Systematic consideration of the business model would likely be required for such an effort. Relatedly, both case studies of the thesis focused on platform providers even though platform leadership is equally applicable to platform sponsors and users. Given that some of the largest technology companies in the world are open source platform users (and not providers), case studies focusing on strategies employed by companies acting in these other platform roles would be beneficial. 100 Appendix Table 17 - Committers to Apache Spark; Extracted from https://cwiki.apache.org/ on Oct 1st, 2014 Name Andrew Xia Stephen Haberman Mark Hamstra Aaron Davidson Andrew Or Andy Konwinski Josh Rosen Matei Zaharia Michael Armbrust Patrick Wendell Reynold Xin Tathagata Das Xiangrui Meng Thomas Dudziak Prashant Sharma Jason Dai Nick Pentreath Shane Huang Imran Rashid Ryan LeCompte Ankur Dave Charles Reiss Haoyuan Li Joseph Gonzalez Kay Ousterhout Mosharaf Chowdhury Shivaram Venkataraman Sean McNamara Mridul Muralidharam Ram Sriharsha Robert Evans Thomas Graves Organization Alibaba Bizo ClearStory Data Databricks Databricks Databricks Databricks Databricks Databricks Databricks Databricks Databricks Databricks Groupon Imaginea, Pramati, Databricks Intel Mxit National University of Singapore Quantifind Quantifind UC Berkeley UC Berkeley UC Berkeley UC Berkeley UC Berkeley UC Berkeley UC Berkeley Webtrends Yahoo! Yahoo! Yahoo! Yahoo! 101 Platform Leadership in Open Source Software Table 18 - Top 10 Big Data Vendors by Revenue according to Wikibon.org [77] 2013 Worldwide Big Data Revenue by Vendor ($US millions) Vendor Big Data Revenue Total Revenue IBM $1,368 $99,751 Big Data Revenue as % of Total Revenue 1% SAP $545 $22,900 2% 0% 76% 24% HP $869 $114,100 1% 42% 14% 44% Oracle $491 $37,552 1% 28% 37% 36% Teradata $518 $2,665 19% 36% 30% 34% Microsoft $280 $83,200 0% 0% 63% 37% Pivotal $300 $300 100% 15% 50% 35% Cloudera $73 $73 100% 0% 53% 47% Hortonworks $55 $55 100% 0% 73% 27% MapR $35 $35 100% 0% 77% 23% Total $18,607 38% 22% 40% n/a n/a % Big Data Hardware Revenue 31% % Big Data Software Revenue 27% % Big Data Services Revenue 42% Table 19 – Hadoop-related Apache committers by project and organizations; extracted from www.apache.org on May 1st, 2014 Project Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo PMC / Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Name Pushpinder Heer Mike Fagan Arshak Navruzyan Ed Kohlwey Andrew George Wells Alex Moundalexis Hung Pham Jessica Seastrom Jeff Field Jonathan M. Hsieh Ryan Fishel Vikram Srivastava Aaron Glahe Christian Rohling Ravi Mutyala Steve Loughran Ted Yu Jared Winick Laura Peaslee Jim Klucar 102 Organization Applied Physics Laboratory Applied Technical Systems Arcus Research Argyle Data Booz Allen Hamilton ClearEdgeIT Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Data Tatics Endgame Hortonworks Hortonworks Hortonworks Koverse Objective Solutions, Inc. Splyt Appendix Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Chris McCubbin Jonathan Park Luke Brassard Michael Allen Michael Berman Oren Falkowitz Phil Eberhardt Miguel Pereira Damon Brown Kevin Faro Dennis Patrone Al Krinker Chris Bennight David M. Lyle Ed Coleman Edward Yoon Jason Then Jay Shipper Jesse Yates Joe Skora John Stoneham Matthew Kirkley Michael Wall Morgan Haskel Nguessan Kouame Philip Young Ryan Leary Sapah Shah Scott Kuehn Sean Hickey Supun Kamburugamuva Tim Halloran Tim Reardon Travis Pinney Ravi Prakash Aaron T. Myers Colin Patrick McCabe Doug Cutting Eli Collins Harsh J Karthik Kambatla 103 sqrrl sqrrl sqrrl sqrrl sqrrl sqrrl sqrrl SRA International, Inc Tetra Concepts LLC Tetra Concepts LLC The Johns Hopkins University Altiscale, Inc. Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Platform Leadership in Open Source Software Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Sandy Ryza Todd Lipcon Tom White Alejandro Abdelnur Andrew Wang Mayank Bansal Dhruba Borthakur Hairong Kuang Dmytro Molkov Scott Chun-Yang Chen Zheng Shao Andrzej Bialecki Arun C Murthy Arpit Agarwal Arpit Gupta Bikas Saha Brandon Li Chris Nauroth Devaraj Das Enis Soztutar Giridharan Kesavan Hitesh Shah Jian He Jing Zhao Jitendra Nath Pandey Mahadev Konar Matthew Foley Owen O'Malley Ramya Sunil Sanjay Radia Siddharth Seth Steve Loughran Suresh Srinivas Tsz Wo (Nicholas) Sze Vinod Kumar Vavilapalli Haohui Mai Xuan Gong Zhijie Shen Vinayakumar B Eric Yang Kan Zhang 104 Cloudera Cloudera Cloudera Cloudera Cloudera ebay Facebook Facebook Facebook Facebook Facebook Getopt Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Huawei IBM IBM Appendix Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Nigel Daley Amareshwari Sriramadasu Sharad Agarwal Sreekanth Ramakrishnan Christophe Taton Devaraj K Uma Maheswara Rao G Allen Wittenauer Boris Shkolnik Jakob Homan Lohit Vijayarenu Chris Douglas Ivan Mitic Roman Shaposhnik Johan Oskarsson Raghu Angadi Matei Zaharia Junping Du Luke Lu Konstantin Boudnik Konstantin Shvachko Amar Ramesh Kamat Robert(Bobby) Evans Daryn Sharp Jonathan Eagles Jason Lowe Kihwal Lee Koji Noguchi Mukund Madhugiri Tanping Wang Thomas Graves Gregory Chanan Jean-Daniel Cryans Jonathan Hsieh Jimmy Xiang Lars George Michael Stack Todd Lipcon Elliott Clark Matteo Bertozzi Gary Helmling 105 Individual InMobi InMobi InMobi INRIA Intel Intel LinkedIn LinkedIn LinkedIn MapR Microsoft Microsoft Pivotal Twitter Twitter UC Berkeley VMware VMware WANdisco WANdisco Yahoo! Yahoo! Yahoo! Yahoo! Yahoo! Yahoo! Yahoo! Yahoo! Yahoo! Yahoo! Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Continuuity Platform Leadership in Open Source Software Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hbase Hive Hive Hive Hive Hive Hive Hive Hive Hive Hive Hive Hive Hive Hive Hive Hive Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Jonathan Gray Ryan Rawson Doug Meil Amitanand S. Aiyer Kannan Muthukkaruppan Karthik Ranganathan Mikhail Bautin Nicolas Spiegelberg Liyin Tang Devaraj Das Enis Soztutar Jeffrey Zhong Nick Dimiduk Sergey Shelukhin Ted Yu Rajeshbabu Chintaguntla Andrew Purtell Anoop Sam John Ramkrishna S Vasudevan Jesse Yates Lars Hofhansl Nicolas Liochon Chunhui Shen Honghua Feng Liang Xie Prasad Mujumdar Gang Tim Liu Kevin Wilfong Siying Dong Daniel Dai Alan Gates Jason Dere Jitendra Pandey Sushanth Sowmyan Owen O'Malley Prasanth Jayachandran Sergey Shelukhin Vaibhav Gumashta Vikram Dixit Amareshwari Sriramadasu Eric Hanson 106 Continuuity DrawnToScale Explorys Facebook Facebook Facebook Facebook Facebook Facebook Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Huawei Intel Intel Intel Salesforce.com Salesforce.com Scaled Risk Taobao Xiaomi Xiaomi Cloudera Facebook Facebook Facebook Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks InMobi Microsoft Appendix Hive Pig Pig Pig Pig Pig Pig Committer Committer Committer Committer Committer Committer Committer Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Spark Tez Tez Tez Tez Tez Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Yin Huai Xuefu Zhang Mark Wagner Prashant Kommireddi Aniket Mokashi Koji Noguchi Gianmarco De Francisci Morales Stephen Haberman Mark Hamstra Aaron Davidson Andy Konwinski Matei Zaharia Patrick Wendell Reynold Xin Tathagata Das Prashant Sharma Thomas Dudziak Andrew Xia Jason Dai Shane Huang Nick Pentreath Imran Rashid Ryan LeCompte Ankur Dave Charles Reiss Haoyuan Li Josh Rosen Kay Ousterhout Mosharaf Chowdhury Shivaram Venkataraman Sean McNamara Mridul Muralidharam Ram Sriharsha Robert Evans Thomas Graves Arun C Murthy Bikas Saha Gunther Hagleitner Hitesh Shah Siddharth Seth 107 The Ohio State University Inadco LinkedIn Salesforce.com Twitter Yahoo! Yahoo! Bizo ClearStory Data Databricks Databricks Databricks Databricks Databricks Databricks Databricks Groupon Intel Intel Intel Mxit Quantifind Quantifind UC Berkeley UC Berkeley UC Berkeley UC Berkeley UC Berkeley UC Berkeley UC Berkeley Webtrends Yahoo! Yahoo! Yahoo! Yahoo! Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Platform Leadership in Open Source Software Tez Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Accumulo Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer Committer PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member Mike Liddell Patrick Hunt Henry Robinson Benjamin Reed Thawan Kooburat Alex Shraer Mahadev Konar Andrew Kornev Flavio Junqueira Michi Mutsuzaki Camille Fournier Benson Margulies Drew Farris Bill Havanki Mike Drob Sean Busbey Jason Trost Billie Rinaldi Josh Elser Aaron Cordova William Slacum Christopher Tubbs Corey J. Nolet Dave Marion Keith Turner Brian Loss Adam Fuchs John Vines Eric Newton Chris Waring David Medinets Aaron T. Myers Doug Cutting Eli Collins Patrick Hunt Michael Stack Todd Lipcon Tom White Alejandro Abdelnur Dhruba Borthakur Hairong Kuang 108 Microsoft Cloudera Cloudera Facebook Facebook Google Hortonworks Individual Microsoft Nicira RentTheRunway Basis Technology Corp. Booz Allen Hamilton Cloudera Cloudera Cloudera Endgame Hortonworks Hortonworks Koverse Koverse NSA Objective Solutions, Inc. Objective Solutions, Inc. Peterson Technologies Praxis Engineering sqrrl sqrrl SW Complete Inc. Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Cloudera Facebook Facebook Appendix Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hive Hive Hive Hive Hive Hive Hive Hive Hive PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member Zheng Shao Arun C Murthy Devaraj Das Enis Soztutar Giridharan Kesavan Hitesh Shah Jitendra Nath Pandey Mahadev Konar Matt Foley Owen O'Malley Sanjay Radia Siddharth Seth Steve Loughran Suresh Srinivas Tsz Wo (Nicholas) Sze Vinod Kumar Vavilapalli Hemanth Yamijala Amareshwari Sriramadasu Sharad Agarwal Uma Maheswara Rao G Nigel Daley Jakob Homan Chris Douglas Raghu Angadi Luke Lu Konstantin Shvachko Robert(Bobby) Evans Daryn Sharp Jonathan Eagles Jason Lowe Kihwal Lee Thomas Graves Brock Noland Xuefu Zhang Lefty Leverenz Yongqiang He Ning Zhang Raghotham Murthy Gunther Hagleitner Ashutosh Chauhan Thejas Nair 109 Facebook Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks Individual InMobi InMobi Intel Jive LinkedIn Microsoft Twitter VMware WANdisco Yahoo! Yahoo! Yahoo! Yahoo! Yahoo! Yahoo! Cloudera Cloudera Doc of the Bay Dropbox Facebook Facebook Hortonworks Hortonworks Hortonworks Platform Leadership in Open Source Software Hive Hive Hive Hive Hive Hive Hive Pig Pig Pig Pig Pig Pig Pig Pig Pig Pig Pig Pig Pig Pig Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper Zookeeper PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member PMC member Harish Butani Carl Steinbach Edward Capriolo Navis Ryu Namit Jain Ashish Thusoo Joydeep Sensarma Santhosh Srinivasan Daniel Dai Alan Gates Giridharan Kesavan Ashutosh Chauhan Thejas Nair Richard Ding Cheolsoo Park Bill Graham Dmitriy Ryaboy Jonathan Coveney Julien Le Dem Olga Natkovich Rohini Palaniswamy Patrick Hunt Henry Robinson Benjamin Reed Mahadev Konar Ted Dunning Flavio Junqueira Michi Mutsuzaki Camille Fournier Ivan Kelly 110 Hortonworks LinkedIn m6d NexR Nutanix Qubole Qubole Cloudera Hortonworks Hortonworks Hortonworks Hortonworks Hortonworks IBM Netflix Twitter Twitter Twitter Twitter Yahoo! Yahoo! Cloudera Cloudera Facebook Hortonworks MapR Microsoft Nicira RentTheRunway Yahoo! Appendix Table 20 - ISVs and Technology Partners Matrix – Black cells represent a partnership arrangement exists; data extracted from www.hortonworks.com, www.cloudera.com and www.mapr.com on November 27th, 2014 Complement Vendor Cloudera 0xdata Abitech Software Acentrix Actian Actuate Adatao Admatic Aeronomy Aeverie Inc. Affini-Tech Aha! Software AllianceONE AlphaSix Corporation Alpine Data Labs Alteryx Amazon Amdocs Anchormen Apara Solutions Apervi APEXCNS Apigee Appcara Appfluent AquaFold Argil Data Argyle Data Arieso Atigeo AtScale Attivio Attunity Ayasdi Aziksa Azul Systems Basement Supercomputing Basis Technology BC Cloud Hortonworks 1 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 111 0 1 0 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 0 0 1 0 0 MapR 1 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 Platform Leadership in Open Source Software BDI Systems BeagleData Big Data Elephants, Inc. Big Data Partnership Big Switch Networks BioDatomics BIPD Ltd BIPortal GmbH Birst, Inc Bit Stew Systems Blue Canopy Group, LLC BlueData BMC Software Booz Allen Hamilton BPM-Conseil BrainPad Inc. Bright Computing Brillio Broadgate Inc Calpont Corporation Canonical CAS Caserta Concepts Celer Technologies Centerity Systems, Inc. Centrify Century Link Ciber cimt AG Cirro Cisco ClearDATA Cleo Cloud A Cloudian Cloudsoft Comma Soft Composite Software Compsesa Computertekk Compuware 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 112 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 Appendix comSysto Concurrent Contexti Continuent Continuuity Couchbase CSC Cumulus Networks Data Center Warehouse Data Tactics Corporation Databox Databricks Datagres Dataguise DataHub Dataiku Datalakes Datameer DataRPM DataStax DataTorrent Datawatch DBSync Dell Denodo Digital Reasoning DigitalRoute Diyotta Dragonfly Data Factory eCapital Advisors Edis Consulting Elasticsearch EngineRoom.io Envision IT Group Eruces Esri eTouch Systems Eucalyptus Systems, Inc Exar Exasol Excedis 0 1 0 1 1 1 0 0 0 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 113 0 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 0 1 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 1 0 0 0 0 Platform Leadership in Open Source Software Expert System Feedzai FICO First Light Technologies Formation Data Systems FORMCEPT Technologies Fortscale Fusionex Fusion-io Fuzzy Logix, LLC. Globalscape Globant GoGrid Google Grand Logic, Inc. GraphLab GrayMatter Gruter GTRI H2O Hadapt HP HStreaming IBM Ideation816 Corporation IKANOW Impetus Technologies Indigo New Zealand Limited Infobright Infochimps, Inc Informatica Information Builders InfoTrellis Ingenious Qube InsightsOne IntegriChain Interactive Algorithms Inc. is-land Systems Inc. iTalent Corporation Jaspersoft JethroData 1 1 1 0 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 114 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 Appendix Jinfonet Software Jinfonet Software Joyent Kapow Software Karmasphere Keylink Technology Knime.com Knowledgent Kognitio Koverse KPI Partners Inc. LG CNS Likya Teknoloji Logi Analytics Looker LSI Lucidworks ManTech MarkLogic MBI Solutions Mellanox Technologies MetiStream Metric Insights Micromata Microsoft MicroStrategy Mikan Associates MisOne Solution MongoDB MSR Cosmos, LLC Narus Nautilus Technologies NetApp New Relic NFLabs NGData Nimbix Nimble Storage NorCom Novetta Solutions NS Solutions 0 0 1 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 1 0 115 1 1 1 0 0 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 Platform Leadership in Open Source Software NTC Vulkan Nutanix NxtGen O2MC OCTO Onepoint IQ Onramp Corporation OnX Enterprise Solutions Open V OpenOsmium Options I/O Oracle Orzota OS Nexus ParAccel Paxata Pentaho Pepperdata Persistent Systems Pervasive PetaSecure, Inc. PHEMI Platfora Plivo Podium Data Polyform Labs Pragmatix Services Pragsis Predixion Software Prime Dimensions, LLC Protegrity PSSC Labs Puppet Labs Qlik Quaero QuantCell Research Qubole Quest QuickLogix LLC Rackspace Radoop 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 0 0 1 116 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 Appendix RainStor Red Gate Red Hat RedOwl Analytics RedPoint Global Reltio Revelytix Revolution Analytics RTTS SAP SAS Scaled Risk ScaleOut Software Search Technologies Securonix Semantic Research Sematext SequenceIQ SequoiaDB Serendio Servient SGI SHS-Viveon Simba Sisense Skytree Smart Platform SMP Management AG SnapLogic Softlayer SoftNet Solutions Solarflare Solix Technologies Sophias Spectra Logic Splice Machine Splunk Spring SQLstream Sqrrl StackIQ 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 117 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 Platform Leadership in Open Source Software SteamBase SUSE Syncsort SYNTASA Tableau Talend Tamr Targit Tata Consultancy Services Teradata Tervela Think Big Analytics TIBCO Software Tidemark Trace3 Transcend Business Intelligence TransLattice, Inc. Trendwise Analytics Tresata Trifacta Tri-IT Solutions Tugbiz Twingo Typesafe Ubeeko Ubuntu UL Environment Unbelievable Machine Univa Vanilla Veristorm Vintech Solutions, Inc Violin Memory VMware Voltage Security VoltDB Vormetric WANdisco Waterline Data Science WE-Ankor 118 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 Appendix WHISHWORKS WhiteKlay Wibidata, Inc World Wide Technology X15 Software Xenolytics Xiilab XOR Security Xplenty Yeswici LLC Ysance Z Data Inc. Zaloni Zementis Zettaset Zettics Zoho WebNMS Zoomdata Zuhlke Engineering Total Number of Vendors 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0 164 119 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 0 0 1 156 1 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 159 Platform Leadership in Open Source Software This page is intentionally left blank. 120 List of Figures Figure 1 – A system dynamics model of direct network effects ..................................................... 6 Figure 2 – A system dynamics model of a two-sided platform....................................................... 8 Figure 3 –Roles and Relationships in a Platform-Mediated Network .......................................... 10 Figure 4 – Linux marketshare in various computing segments .................................................... 17 Figure 5 – Results from the "Java Use and Awareness Study" from BZ Research, 2005 ............ 24 Figure 6 – Eclipse Project Committer by Company ..................................................................... 25 Figure 7 – Porter's Five Forces Model .......................................................................................... 27 Figure 8 – Grove’s Six Forces Diagram ....................................................................................... 28 Figure 9 – The Android Platform .................................................................................................. 31 Figure 10 – Inter-network and Intra-network Competition .......................................................... 34 Figure 11 – Hierarchy of influence within an Apache Software Foundation project ................... 38 Figure 12 – The Purchase Process of Complements ..................................................................... 46 Figure 13 – Example of Platform Fragmentation. ........................................................................ 50 Figure 14 – Ecosytem Hijacking .................................................................................................. 52 Figure 15 – High-level Architecture of Android and Blackberry OS 10 ...................................... 53 Figure 16 – Google search trends of “Hadoop” and “Big Data” vs. ”Data Warehouse” ............. 63 Figure 17 – Major Building Blocks of a Hadoop Application Stack ............................................ 64 Figure 18 – Diagram of basic MapReduce execution ................................................................... 66 Figure 19 – Cumulative Investments in Pureplay Hadoop Vendors ............................................. 73 Figure 20 – Big Data-related Software and Services Revenue .................................................... 74 Figure 21 – Official Committers to Apache Spark by Organization............................................. 78 Figure 22 – Hadoop Contributors by Organization ...................................................................... 84 Figure 23 – Decision-tree for Selecting Development Model ...................................................... 88 121 Platform Leadership in Open Source Software Figure 24 – Analysis of Exclusivity of Partnership Arrangements for Hadoop ISVs .................. 95 Figure 25 – Hadoop in 2011 vs. 2014 ........................................................................................... 95 Figure 26 – Displacing Core Components .................................................................................... 95 122 List of Tables Table 1 – Open source platforms by commercial firms .................................................................. 2 Table 2 – Comparison of Openness by Role in Platform-mediated Networks ............................. 10 Table 3 – Taxonomy of Envelopment Attacks .............................................................................. 15 Table 4 – Ten criteria of open source software ............................................................................. 20 Table 5 – Apple, IBM and Sun Microsystem's Involvement in Open source ............................... 21 Table 6 – Summary of Strategic Considerations for Open Source Platform Vendors .................. 29 Table 7 – AOSP-derived Products by Google Competitors .......................................................... 33 Table 8 – Google's Shift of Investment into Proprietary Capabilities. ......................................... 36 Table 9 – Decision Making Authorities in Different Open Source Communities ........................ 41 Table 10 – The Three V's of Big Data .......................................................................................... 59 Table 11 – A Selection of SQL on Hadoop offerings ................................................................... 70 Table 12 – Breakdown of Hadoop-market according to Forrester Research ................................ 72 Table 13 – Cumulative Investments in Pureplay Hadoop Vendors ............................................... 73 Table 14 – Sample Hadoop Positioning Statements by Enterprise Software vendors .................. 76 Table 15 – Partnership Matrix Between Pure play Vendors and Enterprise Software Vendors .... 76 Table 16 – Partnerships between Pure play Hadoop vendors and cloud IaaS vendors ................. 93 Table 17 – Committers to Apache Spark .................................................................................... 101 Table 18 – Top 10 Big Data Vendors by Revenue ...................................................................... 102 Table 19 – Hadoop-related Apache committers by project and organizations ........................... 102 Table 20 – ISVs and Technology Partners Matrix .......................................................................111 123 Platform Leadership in Open Source Software This page is intentionally left blank. 124 References [1] R. Stallman, “The GNU operating system and the free software movement,” Open sources Voices from open source Revolut., 1999. [2] R. Gilbert and M. Katz, “An economist’s guide to US v. Microsoft,” J. Econ. Perspect., 2001. [3] M. Olson, “The Cloudera Model,” LinkedIn, 2013. [Online]. Available: http://www.linkedin.com/today/post/article/20131003190011-29380071-the-clouderamodel. [Accessed: 31-Mar-2014]. [4] A. S. Grove, Only the Paranoid Survive. Doubleday, 1996. [5] R. Schmalensee, “Jeffrey Rohlfs’ 1974 Model of Facebook,” vol. 7, no. 1, 2011. [6] J. Rohlfs, “A theory of interdependent demand for a communications service,” Bell J. Econ. Manag. …, 1974. [7] M. Katz and C. Shapiro, “Network externalities, competition, and compatibility,” Am. Econ. Rev., vol. 75, no. 3, pp. 424–440, 1985. [8] A. Gawer and R. Henderson, “Platform owner entry and innovation in complementary markets: Evidence from Intel,” J. Econ. Manag. …, vol. 16, no. 1, pp. 1–34, 2007. [9] A. Gawer and M. Cusumano, “Industry platforms and ecosystem innovation,” J. Prod. Innov. …, 2013. [10] M. Cusumano and A. Gawer, Platform Leadership: How Intel, Microsoft, and Cisco Drive Industry Innovation [Hardcover]. Harvard Business Press; 1 edition, 2002, p. 305. [11] O. de Weck, E. Suh, and D. Chang, “Product family strategy and platform design optimization,” pp. 1–38, 2004. [12] M. Cusumano, “Technology strategy and managementThe evolution of platform thinking,” Commun. ACM, vol. 53, no. 1, p. 32, Jan. 2010. [13] K. Boudreau, “Let a thousand flowers bloom? An early look at large numbers of software app developers and patterns of innovation,” Organ. Sci., 2012. 125 Platform Leadership in Open Source Software [14] G. Parker and M. Van Alstyne, “Two-sided network effects: A theory of information product design,” Manage. Sci., 2005. [15] T. Eisenmann, “Opening platforms: how, when and why?,” This Pap. has been …, 2008. [16] A. Gawer, “The organization of platform leadership: an empirical investigation of intel’s management processes aimed at fostering complementary innovation by third,” 2000. [17] R. Henderson and K. Clark, “Architectural innovation: the reconfiguration of existing product technologies and the failure of established firms,” Adm. Sci. Q., 1990. [18] T. Eisenmann, G. Parker, and M. W. Van Alstyne, “Platform Envelopment,” 2007. [19] D. Teece, “Profiting from technological innovation: Implications for integration, collaboration, licensing and public policy,” Res. Policy, vol. 15, no. February, pp. 285– 305, 1986. [20] A. Gillen, “Worldwide Client and Server Operating Environments 2013–2017 Forecast and 2012 Vendor Shares,” IDC Research, 2013. [Online]. Available: http://www.idc.com.libproxy.mit.edu/getdoc.jsp?containerId=243003. [Accessed: 17-Jul2014]. [21] “Operating system Family / Linux | TOP500 Supercomputer Sites,” 2014. [Online]. Available: http://www.top500.org/statistics/details/osfam/1. [Accessed: 17-Jul-2014]. [22] C. DiBona and S. Ockman, Open sources: Voices from the open source revolution. 1999. [23] O. S. Initiative, “History of the OSI,” About the OSI. [Online]. Available: http://opensource.org/about. [Accessed: 15-Sep-2014]. [24] S. Krishnamurthy, “An analysis of open source business models,” Perspect. Free open source Softw., 2005. [25] J. West, “How open is open enough?: Melding proprietary and open source platform strategies,” Res. Policy, 2003. [26] P. G. Capek, S. P. Frank, S. Gerdt, and D. Shields, “A history of IBM’s open source involvement and strategy,” IBM Syst. J., vol. 44, no. 2, pp. 249–257, 2005. 126 References [27] N. Economides and E. Katsamakas, “Two-sided competition of proprietary vs. open source technology platforms and the implications for the software industry,” Manage. Sci., 2006. [28] “Microsoft Uses Open source Code Despite Denying Use of Such Software - WSJ.” [Online]. Available: http://online.wsj.com/news/articles/SB992819157437237260. [Accessed: 04-Aug-2014]. [29] S. O’Mahony, F. Diaz, and E. Mamas, “Ibm and Eclipse (a),” Harvard Bus. Sch. Case, pp. 1–20, 2005. [30] M. Cusumano and A. Gawer, “The elements of platform leadership,” IEEE Eng. Manag. Rev., vol. 43, no. 3, 2003. [31] M. Porter, How competitive forces shape strategy. 1979. [32] O. Alliance, “Open handset alliance,” Retrieved August, 2011. [33] “Nokia X products - Nokia.” [Online]. Available: http://www.microsoft.com/en/mobile/phones/nokia-x/. [Accessed: 23-Dec-2014]. [34] J. Osawa, Chinese Software to Challenge Android - WSJ.com. Online.wsj.com, 2012. [35] Baidu prepares mobile operating system. Financial Times, 2011. [36] R. Brandom, This is Nokia X: Android and Windows Phone collide. The Verge, 2013. [37] I. Research, “Worldwide Smartphone Shipments Edge Past 300 Million Units in the Second Quarter,” IDC Research, 2014. [Online]. Available: http://www.idc.com/getdoc.jsp?containerId=prUS25037214. [Accessed: 18-Aug-2014]. [38] R. Amadeo, “Google’s iron grip on Android: Controlling open source by any means necessary | Ars Technica,” Arstechnica, 2013. [Online]. Available: http://arstechnica.com/gadgets/2013/10/googles-iron-grip-on-android-controlling-open source-by-any-means-necessary/. [Accessed: 12-Aug-2014]. [39] J. Brodkin, “Google blocked Acer’s rival phone to prevent Android ‘fragmentation’ | Ars Technica,” Arstechnica, 2012. [Online]. Available: 127 Platform Leadership in Open Source Software http://arstechnica.com/gadgets/2012/09/google-blocked-acers-rival-phone-to-preventandroid-fragmentation/. [Accessed: 19-Aug-2014]. [40] “How the ASF works.” [Online]. Available: http://www.apache.org/foundation/how-itworks.html. [Accessed: 01-Sep-2014]. [41] A. S. Foundation, “Project Management Committee Guide,” 2012. [Online]. Available: http://www.apache.org/dev/pmc.html#what-is-a-pmc. [Accessed: 30-Aug-2014]. [42] E. S. Foundation, “Eclipse Development Process 2011,” Eclipse Development Process, 2011. [Online]. Available: https://www.eclipse.org/projects/dev_process/development_process.php#4_6_1_PMC. [Accessed: 30-Aug-2014]. [43] “Understanding the Open Source Development Model.” [Online]. Available: file:///D:/Downloads/lf_os_devel_model.pdf. [Accessed: 30-Aug-2014]. [44] “Roles and Leadership — Mozilla.” [Online]. Available: https://www.mozilla.org/enUS/about/governance/roles/. [Accessed: 30-Aug-2014]. [45] “List of Projects | projects.eclipse.org.” [Online]. Available: https://projects.eclipse.org/. [Accessed: 23-Dec-2014]. [46] N. Daidj and T. Isckia, “Entering the economic models of game console manufacturers,” Commun. Strateg., 2009. [47] J. Prieger and W. Hu, “Applications barrier to entry and exclusive vertical contracts in platform markets,” Econ. Inq., 2012. [48] “Frequently Asked Questions | Android Open Source.” [Online]. Available: https://source.android.com/faqs.html#what-is-the-role-of-google-play-in-compatibility. [Accessed: 30-Oct-2014]. [49] “BlackBerry, Amazon Licensing Agreement to Bring Thousands of New Apps | Inside BlackBerry.” [Online]. Available: http://blogs.blackberry.com/2014/06/amazonappstore/?utm_medium=social&utm_source=TWITTER:BlackBerry&utm_campaign=Ap ps&linkId=8550417. [Accessed: 30-Oct-2014]. 128 References [50] “Ahead Of Smartphone Launch, Amazon Announces Its Appstore Has Tripled Year-OverYear To 240,000 Apps | TechCrunch.” [Online]. Available: http://techcrunch.com/2014/06/16/ahead-of-smartphone-launch-amazon-announces-itsappstore-has-tripled-year-over-year-to-240000-apps/. [Accessed: 30-Oct-2014]. [51] “Portland Project hits 1.0 milestone | Ars Technica.” [Online]. Available: http://arstechnica.com/uncategorized/2006/10/7977/. [Accessed: 10-Sep-2014]. [52] “Meet the BlackBerry wizardry that created its ‘better Android than Android’ • The Register.” [Online]. Available: http://www.theregister.co.uk/2013/11/25/revealed_how_blackberry_made_its_better_andr oid_than_android/. [Accessed: 03-Sep-2014]. [53] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, 2005. [Online]. Available: http://research.google.com/archive/mapreduce-osdi04-slides/index.html. [Accessed: 02Apr-2014]. [54] S. Ghemawat, H. Gobioff, and S. Leung, “The Google file system,” ACM SIGOPS Oper. Syst. …, 2003. [55] Hadoop: The Definitive Guide [Paperback]. O’Reilly Media; Third Edition edition, 2012, p. 688. [56] S. Kohr, “The Origins of ‘Big Data’: An Etymological Detective Story,” New York Times, 2013. [Online]. Available: http://bits.blogs.nytimes.com/2013/02/01/the-origins-of-bigdata-an-etymological-detective-story/?_php=true&_type=blogs&_r=0. [Accessed: 23-Sep2014]. [57] “A Personal Perspective on the Origin (s) and Development of ‘Big Data’: The Phenomenon, the Term, and the Discipline∗,” 2012. [58] D. Laney, “3D data management: Controlling data volume, velocity and variety,” META Gr. Res. Note, 2001. [59] M. Beyer and D. Laney, “The Importance of’Big Data': A Definition,” Stamford, CT Gart., 2012. 129 Platform Leadership in Open Source Software [60] E. F. Codd, “A relational model of data for large shared data banks,” Commun. ACM, vol. 13, no. 6, pp. 377–387, Jun. 1970. [61] N. Shamgunov, “Scaling Up And Out,” Dr. Dobb’s, 2012. [Online]. Available: http://www.drdobbs.com/database/scaling-up-and-out/240142249. [Accessed: 11-Oct2014]. [62] M. Gualtieri and N. Yuhanna, “The Forrester Wave TM : Big Data Hadoop,” 2014. [63] “Gartner Says Business Intelligence and Analytics Need to Scale Up to Support Explosive Growth in Data Sources.” [Online]. Available: http://www.gartner.com/newsroom/id/2313915. [Accessed: 02-Nov-2014]. [64] J. Kelly, “Data Warehouse Vendors Moving To Contain The Hadoop Threat,” Wikibon, 2014. [Online]. Available: http://wikibon.org/wiki/v/Data_Warehouse_Vendors_Moving_to_Contain_the_Hadoop_T hreat. [Accessed: 03-Nov-2014]. [65] “Google Trends - Web Search interest: hadoop, big data, data warehouse - Worldwide, 2004 - present.” [Online]. Available: http://www.google.com/trends/explore#q=Hadoop%2C Big Data%2C Data warehouse&cmpt=q. [Accessed: 03-Nov-2014]. [66] “HDFS Alternatives - Hadoop Ecosystem.” [Online]. Available: http://hadoopecosystem.whatazoo.com/home/services/core-layers/persist/hdfs/hdfsalternatives. [Accessed: 06-Dec-2014]. [67] “Apache Hadoop YARN – Background and an Overview - Hortonworks.” [Online]. Available: http://hortonworks.com/blog/apache-hadoop-yarn-background-and-anoverview/. [Accessed: 16-Oct-2014]. [68] M. Zaharia and M. Chowdhury, “Spark: cluster computing with working sets,” … cloud Comput., 2010. [69] “Spark Incubation Status - Apache Incubator.” [Online]. Available: http://incubator.apache.org/projects/spark.html. [Accessed: 10-Nov-2014]. 130 References [70] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” ACM SIGOPS Oper. …, 2007. [71] “Apache Hadoop 2 is now GA! - Hortonworks.” [Online]. Available: http://hortonworks.com/blog/apache-hadoop-2-is-ga/. [Accessed: 25-Oct-2014]. [72] “Community Effort Driving Standardization of Apache Spark Through Expanded Role in Hadoop Projects.” [Online]. Available: http://www.cloudera.com/content/cloudera/en/about/press-center/pressreleases/2014/07/01/community-effort-driving-standardization-of-apache-sparkthrough.html. [Accessed: 10-Nov-2014]. [73] C. Olston, B. Reed, and U. Srivastava, “Pig latin: a not-so-foreign language for data processing,” Proc. 2008 …, 2008. [74] “Ambari Incubation Status - Apache Incubator.” [Online]. Available: http://incubator.apache.org/projects/ambari.html. [Accessed: 01-Nov-2014]. [75] CrunchBase, “CrunchBase Data Exports,” 2014. [Online]. Available: http://info.crunchbase.com/about/crunchbase-data-exports/. [Accessed: 04-Nov-2014]. [76] Hortonworks, “Form S-1 Registration Statement under the securities act of 1933,” 2014. [Online]. Available: http://www.sec.gov/Archives/edgar/data/1610532/000119312514405390/d748349ds1.htm . [Accessed: 17-Nov-2014]. [77] “Big Data Vendor Revenue And Market Forecast 2013-2017 - Wikibon.” [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017. [Accessed: 03-Nov-2014]. [78] P. Z. D. D. K. P. T. D. D. C. J. Giles, Harness the Power of Big Data - The IBM Big Data Platform. . [79] H. Blog, “SAP + Hortonworks = Instant Access + Infinite Scale with HANA + Hadoop.” [Online]. Available: http://hortonworks.com/partner/sap/. [Accessed: 08-Nov-2014]. 131 Platform Leadership in Open Source Software [80] Oracle Corporation, “Introduction to Oracle Database,” 2013. [Online]. Available: http://docs.oracle.com/cd/E11882_01/server.112/e25789/intro.htm#CNCPT88781. [Accessed: 01-Nov-2014]. [81] “Title: UDA Data Sheet: Exploit All Your Data with Teradata Unified Data ArchitectureTM.” [Online]. Available: http://www.teradata.com/Resources/Brochures/UDA-Data-Sheet-Exploit-All-Your-Datawith-Teradata-Unified-Data-Architecture/?LangType=1033&LangSelect=true. [Accessed: 08-Nov-2014]. [82] “Microsoft Analytics Platform System Solution Brief.” [Online]. Available: file:///C:/Users/kencw_000/Downloads/Analytics_Platform_System_Solution_Brief.pdf. [Accessed: 08-Nov-2014]. [83] “Find Partners | MapR.” [Online]. Available: https://www.mapr.com/partners/find-partner. [Accessed: 09-Nov-2014]. [84] “Partners.” [Online]. Available: http://www.cloudera.com/content/cloudera/en/partners.html. [Accessed: 09-Nov-2014]. [85] “We do Hadoop. Together.” [Online]. Available: http://hortonworks.com/partners/. [Accessed: 09-Nov-2014]. [86] “Magic Quadrant for Cloud Infrastructure as a Service,” Gartner Group. [Online]. Available: http://www.gartner.com/technology/reprints.do?id=11UKQQA6&ct=140528&st=sb. [Accessed: 06-Dec-2014]. [87] A. T. Labs., “Hadoop Deployment Comparison Study.” [88] “Committers - Spark - Apache Software Foundation.” [Online]. Available: https://cwiki.apache.org/confluence/display/SPARK/Committers. [Accessed: 10-Nov2014]. [89] “Cloudera Enterprise 5 Announced - insideBIGDATA.” [Online]. Available: http://insidebigdata.com/2013/10/29/cloudera-enterprise-5-announced/. [Accessed: 12Nov-2014]. 132 References [90] “Cloudera Plans Data Hub Role For Hadoop - InformationWeek.” [Online]. Available: http://www.informationweek.com/big-data/software-platforms/cloudera-plans-data-hubrole-for-hadoop/d/d-id/1112099. [Accessed: 12-Nov-2014]. [91] J. Twentyman, “Cloudera vs. Hortonworks: Hadoop to complement or replace data warehouse.” [Online]. Available: http://www.computerweekly.com/feature/Cloudera-vHortonworks-Hadoop-to-complement-replace-data-warehouse. [92] “Cloudera Trash Talks With Enterprise Data Hub Release - InformationWeek.” [Online]. Available: http://www.informationweek.com/big-data/software-platforms/cloudera-trashtalks-with-enterprise-data-hub-release/d/d-id/1113677. [Accessed: 14-Nov-2014]. [93] “Rob ‘Flipper’ Bearden plans to FLOAT his Hadoop heffalump • The Register.” [Online]. Available: http://www.theregister.co.uk/2013/11/21/rob_bearden_hortonworks_playbook/?page=2. [Accessed: 20-Nov-2014]. [94] “Teradata Portfolio for Hadoop.” [Online]. Available: http://www.teradata.com/TeradataPortfolio-for-Hadoop/?LangType=1033&LangSelect=true. [Accessed: 31-Dec-2014]. [95] “Here’s why HP invested $50M in the Hortonworks approach to Hadoop — Tech News and Analysis.” [Online]. Available: https://gigaom.com/2014/08/02/heres-why-hpinvested-50m-in-the-hortonworks-approach-to-hadoop/. [Accessed: 31-Dec-2014]. [96] “MapR, Teradata Ink Deal, Bad Timing for Hortonworks?” [Online]. Available: http://www.cmswire.com/cms/big-data/mapr-teradata-ink-deal-bad-timing-forhortonworks-027253.php. [Accessed: 31-Dec-2014]. [97] K. W. Mike Olson, “Private Interview.” . [98] “Intel and Cloudera: Why we’re better together for Hadoop - TechRepublic.” [Online]. Available: http://www.techrepublic.com/blog/data-center/intelcloudera/. [Accessed: 10Apr-2014]. [99] “Intel Validates Hadoop Market - Wikibon.” [Online]. Available: http://wikibon.org/wiki/v/Intel_Validates_Hadoop_Market. [Accessed: 20-Nov-2014]. 133 Platform Leadership in Open Source Software [100] “The Community Effect | Cloudera Engineering Blog.” [Online]. Available: http://blog.cloudera.com/blog/2011/10/the-community-effect/. [Accessed: 24-Nov-2014]. [101] “Reality Check: Contributions to Apache Hadoop - Hortonworks.” [Online]. Available: http://hortonworks.com/blog/reality-check-contributions-to-apache-hadoop/. [Accessed: 24-Nov-2014]. [102] “The Yahoo! Effect - Hortonworks.” [Online]. Available: http://hortonworks.com/blog/theyahoo-effect/. [Accessed: 25-Nov-2014]. [103] “Enterprise Hadoop from Hortonworks.” [Online]. Available: http://hortonworks.com/why-hortonworks/. [Accessed: 27-Nov-2014]. [104] “Hortonworks CEO Rob Bearden: Beware the Hadoop fragmentation | ZDNet.” [Online]. Available: http://www.zdnet.com/hortonworks-ceo-rob-bearden-beware-the-hadoopfragmentation-7000013961/. [Accessed: 14-Nov-2014]. [105] “Built to Last: How MapR’s Business Model Supports That Goal | MapR.” [Online]. Available: https://www.mapr.com/blog/built-to-last-how-maprs-business-model-supportsthat-goal#.VHavzovF_D8. [Accessed: 27-Nov-2014]. [106] “Cloudera whoops as its Hadoop loop-the-loops for cloud troupe • The Register.” [Online]. Available: http://www.theregister.co.uk/2013/10/28/cloudera_hadoop_cloud_partnerships/. [Accessed: 01-Dec-2014]. [107] M. Stonebraker, “Hadoop at a Crossroads?,” Communications of the ACM, 2014. [Online]. Available: http://cacm.acm.org/blogs/blog-cacm/177467-hadoop-at-acrossroads/fulltext#.U_-F6RqsWmc.twitter. [Accessed: 07-Oct-2014]. [108] “Analytics with Cassandra : DataStax.” [Online]. Available: http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apachehadoop. [Accessed: 05-Dec-2014]. 134