For the exclusive use of L. Wang, 2024. 9 -6 2 2 -0 6 0 MAY 26, 2022 KARIM R. LAKHANI SHANE GREENSTEIN KERRY HERMAN AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services The vision is to democratize machine learning for the entire planet. — Amazon SageMaker Team It was 2018, and Bratin Saha, vice president, Amazon Web Services (AWS) Machine Learning, and a handful of his AWS colleagues were debating their latest challenge. Machine learning (ML), whose fundamental structure was inspired by the human brain, offered a promise of insights and prediction through analytics based on data. Unlike other analytics, ML could detect patterns in unstructured data. Uses of ML included predicting patients at risk of opioid use disorder by looking at patient prescriptions; observing products on a factory line to be sure they met quality standards; and, with sensors that captured vibrations, signaling when a machine needed maintenance. While the cloud computing services AWS offered had quickly become one of the most sought-after services across the globe—and was approaching being pervasive and mainstream—creating an ML service posed some challenges. The data AWS customers stored in the cloud included everything: text, images, geospatial data, X-rays, audio or speech assets, and much more. Saha noted, “Traditional analytics relied on structured data, i.e., using data to predict something. But the vast majority of data in the world—as much as 80% of it—was unstructured. Companies were swimming in text, emails, social media, tweets. But we could not tap into this data. Our customers were looking for a way to tap into this data and drive better outcomes; machine learning held promise for this.” Large technology (tech) companies—those with teams of 10 or more data scientists—and a few others had been investing in ML for some time and were creating a lot of value by integrating ML into services. Saha and his team estimated they made up the vast majority of 2017 ML revenues. Other, smaller companies were interested, and as Saha said, “thought they would experiment a bit.” These smaller players made up the vast majority of ML volume. And increasingly, AWS was seeing growing interest from mid-sized companies. Saha said, “There were some who thought ML would eventually go big and were investing in it and starting to integrate.” Professors Karim R. Lakhani and Shane Greenstein and Director Kerry Herman (Case Research & Writing Group) prepared this case. It was reviewed and approved before publication by a company designate. Funding for the development of this case was provided by Harvard Business School and not by the company. HBS cases are developed solely as the basis for class discussion. Cases are not intended to serve as endorsements, sources of primary data, or illustrations of effective or ineffective management. Copyright © 2022 President and Fellows of Harvard College. To order copies or request permission to reproduce materials, call 1-800-545-7685, write Harvard Business School Publishing, Boston, MA 02163, or go to www.hbsp.harvard.edu. This publication may not be digitized, photocopied, or otherwise reproduced, posted, or transmitted, without the permission of Harvard Business School. This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. 622-060 AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services As they considered building an ML service for AWS under the name SageMaker, the team had to decide where to start. Saha asked, “Which customers should AWS target first with their ML service?” Amazon Web Services: A Brief History In 2006, Amazon launched AWS, a set of information technology (IT) infrastructure services to businesses in the form of web services—eventually to become known as cloud computing. In the early 2000s, as it solved for its own e-commerce fulfillment needs, Amazon addressed its early (1994) IT development approaches, none of which had necessarily planned for future requirements. Amazon leadership discovered a lack of coordination across infrastructure teams, leading to duplicated efforts around building resources for each project, with little attention to potential efficiencies around scale or reuse. 1 This hindered any ability to move quickly and flexibly. To solve this problem, Amazon’s leadership called for a set of standard infrastructure services to be built for the entire company to access. They mandated well-documented application programming interfaces (APIs), creating an ordered and disciplined approach to development across Amazon’s internal development community; these eventually translated seamlessly to third-party developers as Amazon opened its services. 2 The services all operated behind the AWS Management Console. As one observer noted, Amazon essentially made infinite disk space available to anyone, at a low, pay-onlyfor-what-you-need price, and paired this with simple APIs such that anyone could “pick it up and build something useful in it, in the first 24 hours of using an unreleased, unannounced product.” 3 In meeting its own e-commerce fulfillment needs, Amazon had developed valuable expertise in building and running cost-effective data centers, as well as compute, storage, and database infrastructure services. These different components combined to form an Internet “operating system.” 4 Packaged as discrete components, on a pay-as-used basis, these services formed AWS. Andy Jassy, then chief of staff to CEO Jeff Bezos, said, “We realized we could contribute all of those key components of that [I]nternet operating system, and with that we went to pursue this much broader mission [. . .] which is [. . .] to allow any organization of company or any developer to run their technology applications on top of our technology infrastructure platform.” 5 As its CEO, Jassy led AWS from its inception in 2003 as a stand-alone business providing tech infrastructure for customers’ business use cases, including storage, computing, advertising, ecommerce, and product development. 6 (See Exhibit 1 for the AWS service structure and Exhibit 2 for a representative list of AWS services.) Amazon advertised the service as an opportunity for small enterprises or startups to quickly scale their businesses, as it eliminated the need for each small company or startup to build and maintain their own tech infrastructure. 7 This gave AWS an early set of connections with new users in the mass-middle business market. Machine Learning Machine learning, a subset of artificial intelligence (AI), taught computer programs to identify patterns in datasets. 8 Early ML work took a knowledge-driven approach. Many traced the earliest examples of a computer learning program to IBM researcher Arthur Samuel, who in 1952 wrote a program for an IBM computer to play checkers. The more games the computer played, the more it improved, as it learned which moves contributed to winning strategies. In 1957, when psychologist and Cornell researcher Frank Rosenblatt helped design computers that learned through trial and error, 2 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services 622-060 he worked on perceptrons a based on a type of neural network that simulated thought processes in the human brain. 9 In 1967, the “nearest neighbor” algorithm helped computers use pattern recognition. Other advances followed. Explanation Based Learning, or EBL, taught computers to analyze training data, and by discarding data it identified as unimportant, create a general rule it could follow. NetTalk learned to speak (or pronounce words) the same way that an infant would. 10 In the 1990s, ML work shifted to a more data-driven approach, with programs aimed at helping computers analyze data to make insights or gain learning. 11 ML consisted of either supervised learning (a model that categorized data based on pre-determined criteria), unsupervised learning (a model that analyzed datasets to reveal previously unknown trends or associations), or reinforcement learning (a model that identified the best path to reach a particular outcome). 12 For example, ML could learn enough to accurately identify malware, or patterns around anomalous ways data was accessed in the cloud to successfully predict security flaws. 13 It had applications in the Internet of Things, including some related to automation of mechanical processes and safety on production lines. It also helped create recommendation systems, voice-activated virtual assistants, and automation. 14 ML typically took place in several stages. First, users collected the relevant data and “cleaned” it to standardize all accompanying metadata. For instance, within a dataset that listed educational institutions, the cleaning stage might transform “HBS” into “Harvard Business School” to prevent the ML algorithms from categorizing them as separate entities. After cleaning the data, users could apply labels, calculations, or other modifications. Next, developers built algorithms that analyzed the data. An ML model’s initial datasets for supervised learning trained the algorithms to perform specific functions, such as predicting whether an image depicted a cat or a dog. 15 In contrast, unsupervised learning algorithms analyzed datasets to identify trends; for example, Netflix’s ML algorithms segmented customers based on their previous viewing history. 16 After creating and training the algorithms, programmers evaluated the accuracy of the ML’s results and, if necessary, recalibrated the model or exposed it to additional training datasets. 17 In the final stage of ML development, programmers deployed the system and tracked its performance over time. 18 ML relied on machine learning libraries—essentially a compilation of functions and routines ready for use—that allowed developers and data scientists to focus on the model they wanted to train and deploy, rather than writing code to manage the underlying complexities of their data. They offered common learning algorithms and utilities, and were typically usable across a range of code, including Java, Python, and other popular programming languages. Different libraries focused on various functionalities, i.e., text processing, graphics, data manipulation, or scientific computation. 19 Companies applying ML across their organization all typically applied the same series of steps to their ML operations: data collection, data processing, feature engineering, data labeling, model design, model training, model optimization, and model deployment and monitoring. AWS (and other cloud vendors) aimed to provide a suite of tools that enabled data scientists to complete these steps—pulling data from their data warehouse, creating their algorithm model code, and deploying the model onto production—within one go-to environment and set of tools. a A perceptron, sometimes referred to as the first neural network, was an algorithm that classified input into two possible categories (i.e., binary). 3 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. 622-060 AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services Amazon SageMaker: Making Machine Learning Accessible, But How? Amazon had been doing ML since the early 2000s for Amazon.com. “We know a lot about what our customers want,” Mukesh Karki, SageMaker’s engineering director, said. “We can do this and simplify it for our clients.” Amazon’s retail teams had often innovated to solve internal problems, at times looking to find a common widget. Omer Zaki, SageMaker’s general manager, added, “Back in the day, Amazon did a lot of ML for their own operations and retail. These were a leading indicator of what the customer or industry would need.” The most significant gaps to making ML more accessible were software tools for different phases of the ML workflow; simplifications that could make the math behind ML more accessible and standardized; and a standardized interface between the tools used in different phases of the ML workflow. Without these, it was very difficult for developers, engineers, and other business unitaffiliated managers not trained in advanced statistics to apply ML to their company’s needs. Saha said, “For ML and AI we saw that we would have to help develop products that fit customer needs, teach users new capabilities, and seed adoption through various use cases. It meant teaching users how to use AI.” A Long Game? At a broader level, Saha also believed that, while ML might not be a large part of a company’s IT spend, it was possible that ML services would play a decisive role in cloud service uptake. He said, “In 10 to 15 years, ML will be integral to everything a company does. So, while today ML could be a small portion of a company’s total IT spend, over time it will become a bigger portion. It will be an important factor in which cloud provider a customer chooses.” He added, “So we saw we could have a disproportionately large downstream impact.” AWS’s and Amazon’s top leadership supported the team’s explorations. Saha said, “This is where long-term vision helps. Andy (Jassy), Adam (Selipsky), Peter (DeSantis), Swami (Sivasubramanian) and AWS leadership in general always reserve a good part of our planning to review future opportunities so that AWS continues to stay ahead of future customer needs. They ask, ‘What does the business look like several years out? What investments should we look at?’” Yet, as the team plotted building and growing new ML services, operating within AWS or Amazon without a viable business model was not an option. Saha said, “We are a long-term-oriented company, and to show that there is a viable business, we had to make solid business plan, as if we were a startup. We needed intermediate milestones.” Saha formalized the team, drawing on engineers with data services experience. The team sized their investments based on a multi-year plan and set milestones (see Exhibit 3 for their early vision). The milestones provided guard rails around setting goals to reach their targets. Saha added, “This is why AWS is a great place to build a business. We have always worked off the long term.” Relying on AWS Roadmaps Services including Amazon’s Elastic Compute Cloud (EC2), launched in 2006, and Relational Database Services (RDS), spun up in 2009, provided a roadmap of sorts for the new ML services, with some differences. RDS had been able to leverage existing tools, standards, and software approaches and enhanced them with additional features, availability, performance, scalability, reliability, and security. It also made them accessible on an as-needed pricing basis in the cloud. In contrast to moving their enterprise’s IT needs to a cloud provider such as AWS, AI/ML capabilities required a new education of customers and a better understanding of use cases, with many more users or types of 4 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services 622-060 users, all with different learning capabilities. Added to this was the challenge of market segment fit. Zaki said, “The category didn’t exist.” What to Build? Relational database approaches typically operated under established paradigms, with accepted standards across all the main players (Microsoft SQL Server, MySQL, Oracle Database, etc.), meaning they were somewhat standardized. In contrast, ML had no such paradigms and was far less uniform across developers and data scientists, with a mix of libraries, tools, and players, including Apache’s MXNet, Facebook’s PyTorch, and Google’s TensorFlow. The team’s ML effort faced the challenge of creating building blocks with value to offer, but this value would be different for different users, under different circumstances and in differing environments, depending on each customer’s ML experience, data, needs, and goals. The team identified the core capabilities data scientists and developers needed for ML. These included understanding several tools that helped users manage any of the tools or libraries available, and to apply them to their specific needs; training machine learning models; running inference; and, supporting notebooks, b which were popular among data scientists. The team learned that some users needed pre-built algorithms with generic capabilities, while others would bring their own. Containerization—or operating system (OS) virtualization, where apps could run in isolated user spaces on the same OS—was important to these users. This raised the challenge of finding the fastest path to those conditions. The team recognized that customers’ data scientists had to go through a series of steps to even get to the point of being able to start using ML. This included getting computing power provisioned to them, often meaning that systems managers had to take that resource from somewhere else in the company. The team’s first efforts focused on building a component that could manage these needs. Sumit Thakur, SageMaker’s principal product manager, had worked on training ML models at scale. He recalled, “We had to think differently from the way AWS has done things. We’re typically focused on rebuilding something by moving it to the cloud—this is the AWS cloud/enterprise IT story. Now we have to think more in terms of a different persona and build for them.” The team developed on the AWS Management Console—a user interface for their ML service—to launch first. Developing Road Maps Getting early versions of the product out was a priority, and the team focused on iterating on customer input. Saha noted, “We believed velocity of innovation was especially critical in an early domain where standards had not yet been established. Customer feedback played a key role in our roadmap process.” The first step, referred to as the planning season (typically Q2), helped set the stage for strategy and vision setting. The team captured signals coming from customers using early tool and product iterations. This feedback informed a planning exercise—thinking through short- and long-term bets— allowing the team to write up a page or two describing who the customer was and what their needs were. “We’ll identify four to five questions to answer,” said Thakur. “We think of these as business pitches.” This then informed the team’s customer outreach and product adjustments. By the beginning of the following year, the team would have spoken to more customers, understood which ideas were b Notebooks referred to digital computation notebooks used to document procedures, data, calculations and findings. 5 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. 622-060 AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services worth pursuing, and picked out those they considered high-value. Thakur said, “This is what we call ‘working backwards.’” The next step, or what was also called press release and frequently asked questions (PRFAQ), had the team filtering the new product features through the customer’s and end user’s experience. Thakur acknowledged, “This means there is a presence of the customer in the room. It helps align everyone in the room around ‘Who are we building this for?’ It enables us to authentically represent who we are building this for.” At the end of the process, the team sought stakeholder review and approvals at every level. From there, they launched into customer discovery sessions, including writing user stories. The third step involved concretely scoping those features and experiences that the planning process and PRFAQ had specified to be solved (see Exhibit 4). Whom to Build It For? Machine learning customers could be broadly categorized into three categories—sophisticated, mid-market, and just-getting-started. The sophisticated companies were often large tech firms with large data scientist teams who were sophisticated users of ML in many aspects of their business. The team’s research had concluded that, given the early stages for ML, there were only a few such companies, but these top ML spenders were a significant opportunity. Saha said, “The top 10 can be a disproportionate amount of spend. This was an attractive target.” Another group of customers were companies with mid-sized data science teams doing ML. These customers were typically just starting to deploy ML at scale in their companies, and the average revenue per customer was comparatively much smaller. Likewise, this group of customers presented a smaller opportunity, given ML was still a fairly immature domain. Finally, there were many companies who were just getting started with ML. These typically had small data science teams, most of whom were still experimenting with ML. Saha noted, “While the average spend of these customers was also small, the volume here was easily an order of magnitude more than other customer profiles.” 6 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services Exhibit 1 Source: AWS Structure Amazon, “An Introduction to AWS,” https://aws.amazon.com/startups/start-building/how-aws-works/, accessed September 2021. Exhibit 2 Select List of Representative AWS Services, 2011 Function Service Name Compute Amazon Elastic Compute Cloud (EC2) Amazon Elastic MapReduce Auto Scaling Amazon CloudFront Amazon SimpleDB Amazon Relational Database Service (RDS) Amazon Fulfillment Web Service (FWS) Amazon Simple Queue Service (SQS) Amazon Simple Notification Service (SNS) Amazon CloudWatch Amazon Virtual Private Cloud (VPC) Elastic Load Balancing Amazon Flexible Payments Service Amazon DevPay Amazon Simple Storage Service (S3) Amazon Elastic Block Storage (EBS) AWS Import/Export AWS Premium Support Alexa Web Information Service Alexa Top Sites Amazon Mechanical Turk Content Delivery Database E-Commerce Messaging Monitoring Networking Payments and Billing Storage Support Web Traffic Workforce Source: 622-060 Janakiram MSV, “AWS Service Sprawl Starts to Hurt the Could Ecosystem,” Forbes, January 8, 2018, https://bit.ly/3Dn6iSv, accessed December 2021. 7 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. 622-060 AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services Exhibit 3 Early Vision for Amazon SageMaker Product Code to Support the PC Operating System Note: This diagram illustrates code supporting the PC operating system. Moving from left to right, source code, representing higher level abstractions, is grabbed by the compiler which converts it (relying on a range of tools and environments, including libraries, integrated development environment, software project management, code repositories and debuggers) into executable code on the right. Continuous performance, cost, and ease of use improvements are the benefits gained by this code process. Comparing PC and Machine Learning Note: This diagram compares PC and machine learning. As with the process to convert source code to executable code, machine learning starts with data which is then acted on by algorithms which gives a model as output. 8 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services Exhibit 3 (continued) 622-060 Early Vision for Amazon SageMaker Product Code to Support the ML Operating System Source: Company documents. Note: The diagram illustrates the underlying tools and activities involved in the ML process. Data represents higher-level abstractions and relies on SageMaker’s toolkit engines to unpack. From there algorithms and conversion rely on libraries (i.e., SageMaker’s Deep Learning AMI; SM Studio, SM Experiments SM Model Management; and, SM Debugger) to complete their actions. The resulting model coming out of this process relies on runtime monitoring (i.e., SM Monitor). Continuous performance, cost, ease of use improvements are managed or tracked by additional tools (i.e., SageMaker’s Engine—EIA, Dynamic Training, Distributed Training, and MLPerf). Exhibit 4 Illustrative Roadmap to Building High Velocity Innovation Around Customer Needs Step 1 Working Backwards Think through short- and longterm bets. Consider alternative “business pitches.” Reach out to customers and discuss. Product adjustments. Source: Step 2 Press Release & Frequently Asked Questions (PRFAQ) Step 3 Scoping the Feature Discuss as if a customer is in the room. Focus on: “Who are we building this feature for?” Stakeholder review. Narrow activities to key features. Writing user stories. Define features. Use planning process to identify priorities. Identify open issues in PRFAQ. Casewriter. 9 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024. For the exclusive use of L. Wang, 2024. 622-060 AWS and Amazon SageMaker (A): The Commercialization of Machine Learning Services Endnotes 1 Ron Miller, “How AWS Came to Be,” TechCrunch, July 2, 2016, https://tcrn.ch/2ZI91GB, accessed September 2021. 2 Miller, “How AWS Came to Be.” 3 Tom Krazit, “How Amazon’s S3 Jumpstarted the Cloud Revolution,” Protocol, March 12, 2021, https://bit.ly/3Nw47AV, accessed December 2021. 4 Miller, “How AWS Came to Be.” 5 Miller, “How AWS Came to Be.” 6 Amazon, “Cloud Computing with AWS,” https://aws.amazon.com/what-is-aws/, accessed September 2021. 7 Amazon, “An Introduction to AWS,” https://go.aws/3Nwse2w, accessed September 2021. 8 Karen Hao, “What Is Machine Learning?” MIT Technology Review, November 17, 2018, https://bit.ly/36zgMT8, accessed September 2021. 9 Melanie Lefkowitz, “Professor’s Perceptron Paved the Way for AI—60 Years Too Soon,” Cornell Chronicle, September 25, 2010, https://news.cornell.edu/stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon, accessed December 2021. 10 Bernard Marr, “A Short History of Machine Learning—Every Manager Should Read,” Forbes, February 19, 2016, https://bit.ly/36whY9T, accessed December 2021. 11 Marr, “A Short History of Machine Learning—Every Manager Should Read.” 12 Marco Iansiti and Karim R. Lakhani, Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World (Boston: Harvard Business Review Press, 2020), pp. 64-69. 13 Bernard Marr, “The Top 10 AI and Machine Learning Use Cases Everyone Should Know About,” Forbes, September 30, 2016, https://bit.ly/3wJkDrn, accessed November 2021. 14 IBM Cloud Education, “Machine Learning,” IBM, July 15, 2020, https://ibm.co/3IPFYBP, accessed September 2021. 15 Iansiti and Lakhani, Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World, pp. 64-66. 16 Iansiti and Lakhani, Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World, pp. 66-68. 17 Iansiti and Lakhani, Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World, pp. 64-68. 18 Amazon, “Machine Learning with Amazon SageMaker.” 19 Akhil Bhadwal, “15 Best Machine Learning Libraries You Should Know in 2021,” hackr.io blog, February 26, 2021, https://hackr.io/blog/best-machine-learning-libraries, accessed December 2021. 10 This document is authorized for use only by Lu Wang in DSO 574 (2024) taught by Milan Miric, University of Southern California from Jan 2024 to Jun 2024.