Data Fabric as Modern Data Architecture Alice LaPlante Beijing Boston Farnham Sebastopol Tokyo Data Fabric as Modern Data Architecture by Alice LaPlante Copyright © 2021 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more infor‐ mation, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Development Editor: Gary O’Brien Production Editor: Kate Galloway Copyeditor: Audrey Doyle June 2021: Proofreader: Christina Edwards Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea First Edition Revision History for the First Edition 2021-06-02: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Fabric as Modern Data Architecture, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, includ‐ ing without limitation responsibility for damages resulting from the use of or reli‐ ance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of oth‐ ers, it is your responsibility to ensure that your use thereof complies with such licen‐ ses and/or rights. This work is part of a collaboration between O’Reilly and TIBCO. See our statement of editorial independence. 978-1-098-10592-1 [LSI] Table of Contents Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1. Why Build a Data Fabric? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Limits of Existing Data Architectures What Success Looks Like 4 5 2. What Is a Data Fabric?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 The Architectural Pattern of a Data Fabric Building a Data Fabric Is a Journey 11 13 3. How to Get Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Five Pieces of Advice for Getting Started on a Data Fabric Best Practices When Managing and Growing Your Data Fabric Conclusion: It’s Time to Act 16 20 24 iii Introduction Your business faces major changes due to digital transformation. The only way to thrive during the complex transitions that are the inevitable part of your transformation journey is through data. By treating your data as the strategic asset that it is, you can successfully complete your journey in a way that differentiates you from the competition. The good news is that most businesses today understand this: according to an Ernst & Young survey, more than 80% of organiza‐ tions view data as a strategic asset.1 But this doesn’t mean they’re act‐ ing in a way that allows them to get the most from their data: less than half (49.5%) have put a formal data strategy in place, and less than 30% of financial executives said they have fully weighed the costs of poor-quality data. This is because becoming data driven and accomplishing the admit‐ tedly ambitious goal of fully democratizing your data is not particu‐ larly easy. Each of the “five Vs” of big data—volume, velocity, variety, veracity, and value—has its own challenges. But the one we’re going to focus on in this report is arguably one of the top chal‐ lenges to getting the most out of your data: variety. And not just the variety of data structures, formats, and types, but the variety of data meanings (i.e., semantics). 1 “Data: The Strategic Asset,” Financial Executives Research Foundation, Inc., November 2019, https://oreil.ly/uEflr. v Of the five Vs, variety references the different types of data that can exist. When data variety is high, the complexity of the data increa‐ ses, which is the chief reason businesses are seeking data fabrics: they have X sources of data, and every source has hundreds of tables, each with dozens of columns. At the same time, with all these sources of data they must serve Y users or use cases, each requiring slightly different data. Whether data is structured or unstructured is only the beginning of the complexity facing businesses today. Most are familiar with these two categories (three if you add semistructured data) and have fig‐ ured out ways to integrate them. But there are a number of other challenges specifically concerning the variety of data. Chief among them is performing analytics with mixed-modal data—since tradi‐ tional analytics is designed to work with highly formatted data and doesn’t like inconsistent or noisy data. This makes it hard to inte‐ grate different types of data together, which is why data lakes are notoriously difficult to manage. Finally, the quality of data that exhibits a lot of variety can be low. A subset of data variety is data distribution. That is, we would argue that it’s not only the different types, but the number of sources that raise challenges, especially when considering how much data is being created and stored in the cloud. Essentially, data is everywhere, and it is all different. This includes Internet of Things (IoT) data from a distribution warehouse, realtime SAP transactions, and Salesforce or other software-as-a-service (SaaS) datasets. All of these sources may involve customer data of some kind, but each has a different purpose and different data consumers. All the silos in all the departments, each with its own set of tools and techniques, business rules, and definitions that must be orchestra‐ ted, also add to the complexity. Questions arise. Where is the data? What kind of data is it? How can I get the data to the users who need it? Centralization implies control, and some companies are still pursu‐ ing the goal of having only one, centralized source of data (we’ll explain why this is not necessarily such a good idea later in this report). Unsurprisingly (to us), only 6% of companies have achieved vi | Introduction this, according to a recent survey from the Business Application Research Center.2 On the other hand, most companies use multiple data sources (see Figure I-1). Almost one in five companies (18%) use 20 or more data sources for decision-making, with this number expected to grow to 50% in the near future, according to the BARC survey. But the more data sources you have, the more likely it is that data quality will be a problem. Data governance thus becomes even more critical as dependence on multiple data sources increases. Figure I-1. Most companies use multiple data sources for decisionmaking (Source: BARC) A recent research report by Dimensional Research found that many businesses aren’t fully leveraging the data they possess, because these data variety challenges are making it difficult to build or operate data pipelines.3 Almost half (44%) say that critical data is not yet usable for making decisions, and 68% say they can only get further insights from existing data if they have more time. Additional findings include the following: • 59% of companies use 11 or more data sources. • 98% of participants say their pipelines break. • 51% say this breakage happens more than once a month. • 91% say data source availability challenges are the reason why pipelines break daily. 2 “How Many Data Sources Do Companies Rely On For Decision-Making?” Business Application Research Center (BARC), accessed May 6, 2021, https://oreil.ly/vmEpi. 3 “New Survey Finds More Than Two-Thirds of Companies Leave Valuable Data Untap‐ ped,” Business Wire, March 11, 2021, https://oreil.ly/p5W4A. Introduction | vii • 66% report less operational efficiency as a result of broken pipelines. • 59% report delayed decisions or lost opportunities because of broken pipelines. Finally, data complexity can also be caused by data naming conven‐ tions. When businesses use technical data specifications such as table names and column names instead of the business terminology users are familiar with, miscommunications and inconsistencies invariably arise. If a certain kind of data is called different names by different systems—for example, if the definitions of order entry and receivables in Salesforce are different from those in SAP—you’ve got additional complexity to factor in. A Changing World That Needs Data Democratization In addition to these challenges with the data itself, we’re also chal‐ lenged by a world that is in the middle of a major organizational transition. With the onset of the COVID-19 pandemic, office work‐ ers began working remotely, and many may continue to do so once social distance restrictions are fully lifted. In addition, we’re seeing more mobile workers, and even people working nomadically with no fixed office. Indeed, mobile users and so-called digital nomads are causing busi‐ nesses to think in new ways about the user experience. Data analysts who sometimes work from home, sometimes on the road, and sometimes from a café need the same secure access to data that they need at the office. Easier, simpler tools are required. But sometimes it can feel like we’re taking two steps forward and one step back. By 2019, almost half (48%) of businesses said they competed using their data, according to the 2019 NewVantage Part‐ ners survey on big data.4 This showed progress; in NewVantage’s 2006 survey, only 5% of large organizations said this. 4 “Big Data and AI Executive Survey 2019: Executive Summary of Findings,” NewVant‐ age Partners LLC, accessed May 6, 2021, https://oreil.ly/AnyH8. viii | Introduction In NewVantage’s 2020 report, however, the news was not particularly good. Although investment in data was up, showing that companies generally realize data’s importance, the pace of that investment was losing momentum. The percentage of companies investing more than $50 million in data was 65% in 2020, compared to just 40% in 2019. But only 52% of companies were increasing their rate of investment, compared to the 92% that were doing this in 2019.5 Worse, only 38% reported that they had created a data-driven orga‐ nization. Even fewer—only 27%—had built a data culture. This tells us that the all-important goal of data democratization is not being reached. And it’s not necessarily the technology that is holding firms back. Nine out of 10 companies point to people and process chal‐ lenges as the biggest barriers to data democratization. Opportunities Abound—with the Help of a Data Fabric By enabling a distributed, mobile workforce and democratizing data, businesses today can do the following: • Increase operational efficiencies • Better calibrate the right pricing for their goods and services • Personalize sales and marketing initiatives • Improve the customer experience • Identify fraudulent transactions …and much, much more. Until fairly recently, data scientists and analysts squandered 80% of their time wrestling with data and spent just 20% exploring it. That used to be the rule. But IDC’s research director of data integration and data intelligence software, Stewart Bond, reported last year that this rule is starting to bend. IDC’s December 2019 data culture sur‐ vey found that knowledge workers are spending closer to 30% of 5 “NewVantage Partners Releases 2020 Big Data and AI Executive Survey,” Business Wire, Jan. 6, 2020, https://oreil.ly/SPTy5. Introduction | ix their time finding insights.6 The new 70/30 rule is a substantial improvement on the 80/20 one. This is becoming possible because businesses are organizing and managing their data in smarter ways. In particular, by using some‐ thing called a data fabric. In this report, we’ll first describe the conditions that are pushing the limits of current data management strategies. Then we’ll explain what a data fabric is, including its components and its architecture. We’ll highlight the benefits and some early use cases. Finally, we’ll provide five pieces of advice for getting started on deploying a data fabric in your organization, along with some best practices for mak‐ ing sure you’re doing it right. 6 Stewart Bond, “End-User Survey Results: Deployment and Data Intelligence in 2019,” IDC, November 2019, https://oreil.ly/EihMu. x | Introduction CHAPTER 1 Why Build a Data Fabric? Why do you need this thing called a data fabric? It’s not just because of the sheer size of your data. You also are faced with access and integration challenges because of where the data is coming from, where it’s stored, and in what form. You’ve got data on premises. In the public cloud. In private clouds. You have data in multicloud and hybrid cloud ecosystems. Within these various silos, some of the data is structured but most is unstructured, which raises challenges. And don’t forget streaming data—that’s an important part of the pic‐ ture, too. What’s the state of enterprise data, then? Fragmented. A full 93% of enterprises have a multicloud strategy, with 87% having a hybrid cloud environment in place, according to Flexera’s 2020 State of the Cloud survey.1 On average, companies have data stored in 2.2 public and 2.2 private clouds, as well as in various on-premises data reposi‐ tories (see Figure 1-1). Businesses are pushing the limits of what they can do with existing data management tools. 1 Tanner Luxner, “Cloud Computing Trends: 2021 State of the Cloud Report,” Flexera, March 15, 2021, https://oreil.ly/skemo. 1 Figure 1-1. The fragmented state of enterprise data (Source: Flexera) The reasons for this fragmentation are varied, and include the following: Time-to-data-insight is a competitive differentiator Today nearly every business transformation—whether aiming for greater customer intimacy, more optimized operations, or faster innovation—is fueled by data-driven insights. The days when business users would patiently wait weeks or even months for IT to deliver new datasets are gone. Not only are your users demanding rapid responses to their queries, but the competitive nature of today’s markets requires it. The dilemma is that quer‐ ies on databases with billions of records can take hours to return. The need to change this is urgent, as companies with data intelligence shared in real time or near-real time are 18 times more likely to make better and faster decisions than their competitors.2 Demand for self-service data continues to explode Enabled by easier-to-use, more powerful analytics tools such as Power BI and Spotfire, business users are demanding more data, delivered more swiftly. Whether you consider this data democ‐ ratization or data chaos, the trend is very real, and data users’ needs must be satisfied for your organization to maintain a competitive edge. 2 Adam DeMattia, John McKnight, Jennifer Gahm, and Monya Keane, “Research Proves IT Transformation’s Persistent Link to Agility, Innovation, and Business Value,” The Enterprise Strategy Group, Inc., March 2018, https://oreil.ly/sAZUW. 2 | Chapter 1: Why Build a Data Fabric? Data’s relentless growth and fragmentation are accelerating The volume of data today is such that no organization can hope to centralize all its data in one place. Businesses must accept the fact that data is going to be everywhere and will get used every‐ where by virtually everyone: in computer rooms, on desktops and mobile phones, on IoT-connected devices on the factory floor, and by third parties, including customers, vendors, part‐ ners, and more. The dream of centralization is over. Something needs to replace it. The increasing complexity of data analytics means the status quo keeps changing The theory of evolution by natural selection in Darwin’s On the Origin of Species is not about an organism’s ability to thrive based on its absolute fitness per se, but its fitness to adapt to an ever-changing environment. The same is true in the data uni‐ verse. Increases in complexity and in the speed of innovation are wreaking havoc on existing data strategies, architectures, infrastructure, and more. If your goal is to have a data-driven competitive advantage, all of these things must be agile, and must be able to evolve as necessary. The data analytics skills shortage persists And don’t forget the human dimension. For everyone from database analysts to data stewards, data engineers to developers, and business analysts to data scientists, workloads are expand‐ ing exponentially, far faster than your human resources can handle. This slows down your ability to get value from your data and reduces your relative competitive advantage. And just as there are more data sources, there are more—and differ‐ ent kinds of—data consumers. In addition to data scientists and data analysts, you’ve got business users, executives, customers, suppliers, and partners such as distributors and retailers. You’ve even got machines as consumers—IoT devices at the edge of your network, both producing and analyzing data. Added to this are the demands of all the newly remote and dis‐ tributed workers who can be located in the next city, the next state, or across the planet. Why Build a Data Fabric? | 3 You need a new, flexible solution to cope with all of this—one that can achieve the following, arguably difficult-to-hit, objectives: • Simplify data democratization • Unify your data environment • Eliminate data silos • Centrally coordinate data flows • Scale easily, to keep up with increasing data volumes • Span all datatypes • Align IT with the business • Empower remote and mobile workers The Limits of Existing Data Architectures Current methods of managing data that attempt to meet all the objectives using data warehouses and data lakes frequently don’t succeed, because they never include all the data that is needed. But they still remain important components in a larger distributed data landscape. Although data warehouses can solve your integration challenges for much of your data, they never actually integrate all the data. Addi‐ tionally, they’re inflexible. You won’t get the agility you need to respond to your users’ requirements. Finally, applying AI technolo‐ gies like machine learning (ML) is a more demanding task than most data warehouses can cope with—in terms of both the volume of data required and the complexity of the integrations. Alternatively, data lakes can hold unstructured as well as structured data, but it can be difficult to actually find and integrate different datasets as a lake continues to grow. The more data that is placed into a data lake, the more difficult it is to manage it, much less squeeze value from the vast quantities. The popular term for this scenario is data swamp, and it’s something you definitely want to avoid. Although data lakes can be good options for inexpensively processing large and relatively simple datasets, they are constrained from effectively managing today’s complex, multifaceted data that businesses want to locate and analyze swiftly for immediate insights. 4 | Chapter 1: Why Build a Data Fabric? What Success Looks Like If you manage to address all the challenges, your rewards will be substantial. Here’s a taste of what’s to come. With a data fabric you will get the opportunity to do the following: Fuel your data-driven business Support multiple, diverse users and use cases with a modern, distributed data architecture, shared data assets, and optimized data management and integration processes. Accelerate value realization Accelerate time to value by unlocking your distributed onpremises, cloud, and hybrid cloud data, no matter where it resides, and delivering it at the pace of your business. Empower your people with timely, consistent, and trusted data Democratize data access to arm business users with all the data required to make faster and more accurate business decisions. Empower remote and distributed workers as much as your tra‐ ditional office workers. Benefit from technology innovation sooner Embrace new data and analytics technology advancements such as data science, real-time data, and the cloud faster to stay ahead of your competition. Save time and money Streamline data management and integration processes and pipelines via an optimized combination of intelligent, con‐ verged data management and integration capabilities that embed AI/ML and business self-service. Govern and comply with confidence Ensure proper data governance and control so that you can deliver the right data at the right time, securely, and in compli‐ ance with your ever-changing regulatory landscape. To achieve all this you need a data fabric. We’re going to define a data fabric more precisely in Chapter 2, as there are various conflict‐ ing definitions for it. Although it is a relatively new term, the impor‐ tance of what it does is not new. For years, enterprises have struggled to integrate all their data into a single, scalable platform. A data fabric describes a comprehensive way to achieve that goal. What Success Looks Like | 5 CHAPTER 2 What Is a Data Fabric? Let’s start with what a data fabric isn’t. It is not a single product or even a single platform. You can’t buy and deploy it overnight. It is an architecture. And a journey. The good news is that you don’t have to rip and replace your exist‐ ing technology. A data fabric encompasses the data ecosystem you have in place. Neither do you need to be beholden to a single ven‐ dor. You can choose best-of-breed solutions and—in theory at least—they should all work together within your data fabric. To summarize what we discussed in Chapter 1, with a data fabric your users will get to spend more time analyzing their data than wrangling with it. And other consumers of data—think systems and applications—will get access to integrated data. It’s as simple as that. The data fabric is there to make it easier to find data in a way that’s trusted and gives access to anyone. This is the frame for our entire data fabric discussion: that a data fabric will drive the old 80/20 rule (now 70/30) to increasingly favorable proportions. Some people call it data intelligence rather than data fabric, because it makes it easier for users and systems/applications to intelligently find, work with, and clean data, and apply AI models to it. So what is a data fabric? A data fabric is a modern, distributed data architecture that includes shared data assets and optimized data management and integration processes that you can use to address today’s data challenges in a unified way. 7 Despite what many vendors might claim, a data fabric is not a single product or specific platform that you can simply buy and insert into your existing data architecture. It includes architecture, shared data assets, and data management and integration technology. A data fabric supports the following: Data for all users and use cases Provides timely, trusted, reusable data for a wide range of ana‐ lytical, operational, and governance use cases, as well as busi‐ ness self-service users Data from any and all sources Accesses, combines, and transforms both in-motion and at-rest data from across a diverse, distributed data landscape using metadata, models, and pipelines Data that spans any environment Flexibly spans distributed on-premises, hybrid, and multicloud environments In short, a data fabric’s job is to connect any kind of data to any‐ where and anyone (or anything). That’s admittedly a tall order, as IT systems are getting more complex as users demand simplicity for easier, faster decision-making. A data fabric addresses both needs. Let’s be very clear that many of the components that make up a data fabric are not new. They’re constantly evolving, true—especially when the cloud is involved. But it’s the combination of them that cre‐ ates this new thing, this data fabric. Here are some of the components of a typical data fabric: Data catalog Allows you to categorize, access, and collaborate around com‐ pany data across multiple data sources, while enforcing strong governance and access management. Master data management Involves creating a single master record for all business data from across both internal and external data sources. Metadata management How you manage the data that describes other data (the meta‐ data). It involves establishing policies and processes that ensure 8 | Chapter 2: What Is a Data Fabric? information can be integrated, accessed, shared, analyzed, maintained, and governed across your organization. Data preparation/data quality Software that analyzes information and identifies incorrect, incomplete, or improperly formatted data. Data quality tools cleanse or correct that data based on rules you establish. Data integration The process of taking data from different sources and combin‐ ing them into a single view. Integration begins with data inges‐ tion and includes cleansing; extract, transform, and load (ETL) processes; and transformation. By integrating data, you make it possible for users to deploy analytics tools to produce actionable business intelligence. Data analytics The process of examining data to spot trends and draw conclu‐ sions about the information it contains. Data visualization Gives you a way to see what the data is telling you. Rather than being presented in a spreadsheet, table, or some other numeri‐ cally intensive format, the data is graphically represented by such visual entities as charts and graphs. This makes it easier to grasp the trends or messages embedded in the data. Data governance Enforces data-related policies and maintains data quality. It helps users establish guidelines, processes, and accountability to make sure data quality remains satisfactory. According to a report by Allied Market Research, the global data fabric market, which was a fledgling $812.6 million niche in 2018, is estimated to hit $4.54 billion by 2026, representing an impressive compound annual growth rate (CAGR) of 24% from 2019 to 2026.1 The major key factors contributing to this growth are the increasing digitalization of various industries, and IoT, AI, and ML adoption (see Figure 2-1). 1 “Data Fabric Market is Expected to Reach $4.54 Billion by 2026, Says Allied Market Research,” GlobeNewswire Inc., November 24, 2020, https://oreil.ly/Q16KA. What Is a Data Fabric? | 9 Figure 2-1. Industries adopting data fabrics, 2018–2026 (Source: Allied Market Research) How the Remote and Mobile Workforce Has Accelerated Growth of the Data Fabric Market The first year of the COVID-19 pandemic saw much upheaval. The need for businesses to instantaneously transform into virtual organizations, smooth out disrupted supply chains, and in many cases, completely upend their go-to-market strategies drove them to take the following actions: • To establish business continuity, many organizations hastened their in-progress digital transformations, in some cases achiev‐ ing in months what they had predicted would take years. • Work-from-home mandates meant data had to be securely accessible from anywhere, which necessitated that they deploy a data fabric. • The mix of data being created and consumed became richer, as it included much more video communication and more down‐ loaded and streamed video, which underscored the need for businesses to deploy a robust data fabric. 10 | Chapter 2: What Is a Data Fabric? The Architectural Pattern of a Data Fabric The architecture of a data fabric (Figure 2-2) is organized into a pipeline of five stages: Stage 1 Data is collected, often in real time, by a system or person. It could be a customer service representative interacting with someone on the phone. It could be a transactional database. Or it could be a drone or IoT device with a sensor capturing a con‐ stant stream of bits and bytes. You possess one or more of these sources of raw data. Stage 2 The data collected in Stage 1 is extracted and loaded into the database. To do this, you need either extract, transform, and load (ETL), or more recently, extract, load, and transform (ELT) tools and processes. Data quality is ensured during this stage as well, using various deduplication and data cleansing tools. These are critical, because the single biggest problem with data is that it’s laden with mistakes—which can occur through man‐ ual entry, by sensors sending erroneous information, or by a person or system becoming disconnected from the network and causing gaps in the data. This is where you need to deploy vali‐ dation and release tools and processes. Stage 3 The data is stored in the database, data warehouse, or data lake. At the same time, streaming data is being sent from the source to the final data store. Stage 4 The data is transformed, optimized, and—most importantly for a data fabric—virtualized. This is the so-called last mile of get‐ ting data to users. It’s when you have to do a lot of management associated with the data: master data management (MDM), metadata management, and reference data management. Data science AI models are part of the fabric in this stage, because data scientists synthesize and create new data based on observa‐ tions made from old data. The Architectural Pattern of a Data Fabric | 11 Stage 5 The data is delivered to users via tools they can use to browse through, find, and manipulate it. These tools include visualiza‐ tion tools as well as data catalogs and data stores. Figure 2-2. The data fabric architecture Analytics on the Edge The most important achievement of data fabrics is that they enable action at the speed of your business, which is often in real time. This involves capturing, unifying, and making data available through different systems to be analyzed from wherever it’s needed. In today’s distributed and mobile world, data could be needed liter‐ ally anywhere in the world. And because data is usually disparate and distributed, analytics must be as well. So, analytics tools shouldn’t be centralized, but need to be accessing data all the way out at the edge. That’s why data fabrics are so important. If you have a well-defined data fabric, you can move analytics around your environment in a way that you could not do previously. 12 | Chapter 2: What Is a Data Fabric? Building a Data Fabric Is a Journey The first thing businesses want to know about their data is whether they can access it. Next, they want to know whether they can define and add value to it, and then whether they can gain insights from it. The last thing businesses ask when it comes to their data is whether they can make it available quickly enough to derive value from it using analytics. Many organizations spend a great deal of time, energy, and money trying to determine whether they can access their data, and define and add value to it. But they don’t always leverage it with analytics to gain insights. And almost all fail to learn from their insights and then optimize to close that loop. In the future, the way businesses will compete when data fabrics become commodities will be how well they execute in a closed-loop analytics system. Again, we iterate that building a data fabric is not about purchasing and deploying a single solution. Nor does it happen overnight. It’s a journey during which you gradually put all of the many pieces in place (see Figure 2-3). Figure 2-3. The data fabric journey You move from working with use-case-specific data to domainspecific data, taking care to engineer processes and choose tools that will enable you to be consistent and repeat your successes. Only after you have achieved that should you move to enterprise-scale data services, where the data fabric supports the data-driven needs of the entire organization. Building a Data Fabric Is a Journey | 13 Embedding AI/ML capabilities within data-fabric-enabling technol‐ ogy can help address rising complexity and reduce workloads by automating manual processes such as data discovery and matching, data model design, and query optimization. Additionally, tools that make it easy for “citizen” data analysts, data scientists, and “citizen” data engineers to contribute to your data fabric journey can help you expand your pool of data-analytics-enabled people. Finally, with subject matter expertise so critical to data quality, data governance, and the responsible use of data and models, the closer your business users are to the various aspects of your data fabric, the more successful you will be. Your business-domain experts will help you monitor your data at the grassroots level to ensure it is being used wisely and well. 14 | Chapter 2: What Is a Data Fabric? CHAPTER 3 How to Get Started The business value of a data fabric is clear. It provides one place to go for data, for better and faster insights. It offers consistent, highquality, governed, and secure data to everyone throughout your organization. It simplifies your journey to data democratization. And your users will spend less time searching and more time ana‐ lyzing. Table 3-1 shows five prime use cases for data fabrics, the challenges they solve, and the benefits they bestow. Table 3-1. Use cases for data fabrics Data fabric use case Customer engagement (customer 360) Data challenges solved by data fabric Benefits • Customer data is spread across multiple systems, making it difficult to truly understand customers. • Incomplete data impacts sales and marketing effectiveness. One place to go for customer data: Line-of-business (LOB) operations (operations 360) • LOB operations data is distributed in silos. • Incomplete data prevents optimizing resources, increases costs, and reduces customer satisfaction. New-product innovation • Lack of access to data inhibits collaboration spanning multiple groups and systems. • Lack of real-time data access deters speed-to-market and optimization of R&D resources. • Revenue acceleration • Lower cost of sales • Higher return on marketing spend One place to go for LOB data: • Lower operational costs • Faster response to changing operational dynamics One place to go for R&D data: • Faster time to market for new products • Higher return on R&D spend 15 Data fabric use case Compliance Risk management Data challenges solved by data fabric Benefits • Systems are built for operations, not compliance. • It’s difficult to meet evolving compliance requirements and variations of rules across geographies and LOB operations. One place to go for compliance data: • Systemic risk spans organizational and system data silos. • There is no complete view of risk factors and metrics. • Comply faster, with fewer resources • Stay out of jail, stay out of the media, and avoid fines One place to go for risk management data: • Provide a full view of risk • Help avoid catastrophes Create a Data Fabric Center of Excellence Most users aren’t interested in technology frameworks and archi‐ tectures. Although relying on them, they don’t need to be exposed to them. The most successful data fabrics are organized by creating a center of excellence (CoE) around analytics and data. A data fab‐ ric CoE allows you to abstract the technology away from the busi‐ ness users, giving you a better chance of truly democratizing it. In effect, a CoE is a “village” of expertise within your business. And it does take a village to leverage analytics in a highly scaled way across big enterprises. Your users, from business analytics to the CEO, just need to trust that village. Five Pieces of Advice for Getting Started on a Data Fabric Like most important business initiatives—and make no mistake, data fabrics are important business initiatives—support for a data fabric must come from the very top of your organization. The Csuite must be stakeholders. And that means more than the CIO or CTO. The CEO and COO should be involved, too. Another piece of advice common to most technology ventures is to start small. Pick a project, consolidate a small data fabric underneath that project, and then build on that success. A lot of companies start with a large-scale deployment, and three years later they’re still working on it, with little sign of closure. 16 | Chapter 3: How to Get Started In addition to these two pieces of advice that are given for most technology initiatives, here are five more suggestions that are spe‐ cific to building a data fabric. One: Virtualize, Don’t Centralize, Your Data Centralizing your data was a best practice three decades ago, and many enterprises are still struggling to make that vision a reality, even as data volumes and complexity grow. But today, with data being generated and stored everywhere from the data center to the edge to the cloud—and needing to be accessible to users who could be located anywhere in the world—having a single centralized “sys‐ tem of record” doesn’t work anymore. Data virtualization is the answer. Data virtualization is a way to manage data so that users—or applications—can access it, retrieve it, and use it without having to know where it is physically located or how it is formatted. It can make all your data sources look as though they are located in one centralized data store even when they are physically located all across the planet. In a data fabric, you want to create virtual views of your data, not actually move the data to a centralized location. You can apply secu‐ rity controls and authentication protections as though they were all in one place. And it makes things a lot easier for your business users. For a robust data fabric, you definitely want to virtualize, not centralize, your data. Two: Build an Intelligent Data Fabric by Integrating AI into It Even as you’re trying to democratize your data, there’s the beginning of a movement to do the same with AI. By embedding AI into the data fabric itself, you encourage your data scientists to more easily build algorithms—and, more importantly, to pass those algorithms on to your business users for faster and smarter decision-making. This is important, because to derive the most value from your data, you can’t let these AI models only be used by the data team or lan‐ guish in a digital vacuum. You need to operationalize your models— which means deploying these ML algorithms across your entire organization. Five Pieces of Advice for Getting Started on a Data Fabric | 17 For example, data scientists could build predictive analytics models and embed them in the data fabric so that business intelligence tools like Tableau or Power BI can handle requests such as the following: Show me how many tractors we sold in Wyoming last month, and tell me how many are likely to be sold there over the next six months. Users can then ask increasingly complex questions and the models will parse the data to answer them. Giving users access to these magic learning models with just a few clicks dramatically accelerates data democratization and leads to better business outcomes. Three: Automate Virtualization of Your Streaming Data In 2021, automation is the name of the game. Whether you’re talk‐ ing actual, physical robots that work assembly lines in modern fac‐ tories, or software robots (bots) that automate digital business processes, enterprises are trying to automate everything that can be automated. Automation requires data. So, ubiquitous automation means making data universally available to applications through vir‐ tualization, as we discussed in the first piece of advice. But things get more complex when you talk about streaming data— also called data in motion. Unlike data at rest—data that’s stored in a database—streaming data comes into your organization every day in countless ways: from your website, ecommerce store, the internet, embedded sensors on network devices, and more. But unfortunately, the majority of data visualization tools only work on data at rest. That’s why getting up-to-the-second readings of, say, your data cen‐ ter’s electric meter, is so difficult. To do this, the system needs to retrieve data that can change several hundred times per second. You can try to tame streaming data by putting it in a database. But this is like trying to watch a movie as a series of still photographs. So much is lost. If you do this in business, opportunities are lost. That’s why you need streaming data virtualization to be automated. Doing this connects data in motion to business systems, transform‐ ing each piece of data—each “event”—into a row in a table. This is done in real time. Streaming data virtualization then revitalizes those events. By automating streaming data, you give your business users access to all your enterprise’s data in motion, such as network device readings, drone data, and even weather forecasts, to analyze and derive value from it. 18 | Chapter 3: How to Get Started Four: Create a Data-As-A-Service Offering for Your Users Like all “as-a-service” technologies, data as a service (DaaS) is based on the idea that data is a product that can be given to users (or applications) on demand via the cloud, no matter where they are located or what kind of device they are using. This is good for users, because the burden of managing the data— which can get quite complex—and the storage system(s) that houses it falls onto the DaaS provider. The user just takes what is needed and doesn’t worry about any of the backend mechanisms. At its most basic, DaaS is just a new way to allow users to access data easily and without hassles. DaaS eliminates the need to put REST interfaces on your data. Instead, you can directly create APIs from data, which can be inte‐ grated with API management tools that turn those APIs into man‐ aged services. This makes it simple to provide data to your users in a safe and easy way. Five: Create and Nurture a “Data Curation” Culture Your data is a valuable asset—probably the most valuable asset your organization has. The irony is that you can’t protect it by keeping it locked up. The value comes from using it. But you have to be careful that it doesn’t get abused, not just by external hackers or cybercrimi‐ nals, but by your employees, who may inadvertently misuse or mis‐ handle data. Because of this, your data team should, of course, include data stew‐ ards, who are the designated guardians of the data and who put pro‐ cesses in place to ensure that it is used appropriately. They are also data advocates, and attempt to get users enthused about and careful about handling the rich troves of data your organization possesses. But you shouldn’t stop by deputizing formal stewards. You should make everyone aware of the importance of clean, trusted data, and work together to ensure that this very valuable corporate asset is both used and cared for appropriately. Here’s an example: In 2020, when COVID-19 hit, Panera Bread pulled off a dramatic pivot. The fast-food, fresh-lunch restaurant transformed its business model within 10 days. In addition to selling prepared food, Panera decided to sell milk, yogurt, tomatoes, and avocados. Panera leveraged its team of 50,000 people and a network Five Pieces of Advice for Getting Started on a Data Fabric | 19 of 2,000 stores to deliver items in 40 minutes through its delivery network of 10,000 drivers, locations with pick-up, and on Grubhub.1 Panera’s transformation was made possible by a culture of data cura‐ tion. Faced with increasing data, new complexity, and manual pro‐ cesses, Panera CIO John Meister established a vision called “One Panera.” One Panera placed responsibility for data in the hands of business users as well as partners. Instead of putting data behind walls, One Panera democratized it. Today, Panera Bread’s Menu Master Data Space provides a single view of 135,000 prices for items, including price tiers and categories, that its own business users curated. This was possible because of the data tools Panera put into place that did the following: • Enabled knowledge workers to curate and monitor business data in a self-service way • Easily extracted metadata from databases • Incorporated AI models created by data scientists • Helped knowledge workers identify and fix quality issues When integrated with Panera Bread’s data fabric, these tools effec‐ tively made the entire company a part of the data team, making the business agility that Panera so deftly demonstrated possible. Best Practices When Managing and Growing Your Data Fabric Once you have your data fabric in place, you will want to maintain and enhance it over time. Rome was not built in a day. Nor will a data fabric. Start your data fabric journey with small wins that deliver on immediate needs, prove new architecture concepts, and deliver value. Upon those suc‐ cesses, you can grow purposefully toward your vision. 1 “Panera Cooks with Data to Deliver on Service and Satisfied Customers,” TIBCO Soft‐ ware, Inc., accessed May 25, 2021, https://oreil.ly/X7wgd. 20 | Chapter 3: How to Get Started Embrace Technology Evolution and Convergence Your business won’t stand still. And technology never does. Not only is technology improving, but a huge convergence is underway that you can leverage when you modernize your technology as part of data fabric adoption. So, look for modern, convergence-enabling solutions that span traditional technology silos, such as: Analytics convergence Self-service visualization, data science, streaming analytics, and reporting Data management convergence Metadata management, MDM, data governance, data quality, data cataloging, data modeling, data integration, and data security Cloud convergence Storage, computing, development tools, applications, and marketplaces Data analytics and transactional applications convergence Converging technologies and teams IT/Operational technology (OT) convergence Data in motion and data at rest Make Sure Your Data Fabric Is Truly Holistic Do you really consider data to be your most important asset? If so, to drive the highest impact from it you need to stop treating your transaction data, streaming data, metadata, master data, and refer‐ ence data assets as unique, unrelated domains and instead begin managing what is common across your data assets. A holistic data fabric enables synergies that simplify governance, security, control, and much more. More holistic approaches are often enabled by data management technology convergence in general, and via closer inte‐ gration within the fabric in particular. For example, choose a single MDM platform capable of managing customer, product, employee, location, and other key datatypes consistently, rather than using dif‐ ferent MDM platforms for different domains. Best Practices When Managing and Growing Your Data Fabric | 21 Support Today’s Distributed Data Analytics Topology Data is anywhere and everywhere. The same can be said about your data consumers. With so much data and so many users, it’s impossi‐ ble to support traditional data centralization paradigms such as data warehouses and even the more modern data lakes and lake houses, which today are just additional data sources. You need to embrace decentralization, as it is a much better solution for today’s dis‐ tributed data analytics topologies. Look for a data fabric that can span data from any source and deliver it to any consumer, anywhere. Augment Your People Using AI and ML Most organizations lack sufficient numbers of skilled IT staff mem‐ bers and are finding it difficult to recruit and retain data experts in today’s competitive markets. AI and ML are obvious solutions. According to a recent survey, more than 50% of companies have completed at least one AI initiative, with 66%, on average, seeing increased revenue across a broad range of functions (Figure 3-1).2 Figure 3-1. AI is returning high ROI (Source: McKinsey) 2 “The State of AI in 2020,” McKinsey & Company, November 17, 2020, https://oreil.ly/ qPnJA. 22 | Chapter 3: How to Get Started Keep It Open and Flexible Stay open to change and build an architecture that is meant to evolve. Furthermore, because of the risk inherent in big-bang rear‐ chitecture initiatives, remember that much of your current state will continue to change for a long time. So, leverage open standards and APIs to facilitate the inevitable transitions. Finally, start your futurestate data architecture journey with small wins that prove new archi‐ tecture concepts and deliver value. Upon those successes, you can grow purposefully toward larger architecture transformations. Design It for Easy Decoupling and Layering Decoupling lets you separate how you manage data from how you consume it. As a result, you can manage each datatype optimally, within the original source, in an on-premises data warehouse, in a cloud data lake, on an edge device—wherever it makes the most sense from a storage and management point of view—and keep it independent from however you choose to provision it to your busi‐ ness users. Decoupling also helps your business users because it lets you free them from knowing IT internals such as schemas and syn‐ tax. Instead, you can provide a consistent, secure, and governed common data layer consisting of shareable, business-friendly data objects that make it easy for your business users to find and use all of your most important data. Migrate Intelligently Data virtualization lets you insulate your consumers as you migrate your data sources and insulate your sources as you migrate your consumers. And when you’re finished, you can continue to drive value from these decoupled data objects. Don’t Over-Innovate Keep privacy and governance in mind. Just because you can do something with your data doesn’t mean you should. A major retailer got into trouble several years ago by using its data too cleverly, in a way that violated the privacy of its consumers.3 Always think of the 3 Charles Duhigg, “How Companies Learn Your Secrets,” New York Times Magazine, Feb. 16, 2012, https://oreil.ly/p7QI7. Best Practices When Managing and Growing Your Data Fabric | 23 origin of the data, and make sure you don’t harm customers or part‐ ners by using it to take action. Keep Your Processes Standard Focus on a process layer to repeatedly leverage what you’ve built in a repeatable, governed, standardized way. Just integrating systems together is not enough. You must have processes and controls in place to make sure the system is used consistently throughout your organization. It’s like having a well in the center of a village, where everyone goes to collect water. The bucket needs to say the same. The rope needs to stay the same. The crank needs to stay the same. So, standardize as many of the functions and processes of accessing and using data as possible. Conclusion: It’s Time to Act According to NewVantage Partners’ latest data survey, most organi‐ zations have a lot of work to do when it comes to getting value from their data:4 • Only 48.5% drive innovation with data • Only 41.2% compete on analytics • Only 39.3% manage data as a business asset • Only 30.0% have a well-articulated data strategy • Only 24.4% forge a data culture • Only 24.0% create a data-driven organization Clearly, it’s time to do something about this. Building a data fabric is the answer. A data fabric enables organizations to become truly data driven, empowering them to meet the demands of their businesses and gain a competitive edge. Other benefits of building a data fabric encompass removing data silos, gaining control over your data, managing it consistently across multiple environments—on premises, cloud, hybrid, and edge—and reducing hassles of data integration. When you have your 4 “Big Data and AI Executive Survey 2021,” NewVantage Partners LLC, January 2021, https://oreil.ly/hHRKP. 24 | Chapter 3: How to Get Started enterprise-wide data fabric in place, you will be able to do the following: • Perform data management processes on a single unified platform • Pull and connect or collaborate on data from disparate sources across locations • Manage data across all environments (multicloud, hybrid, and on premises) • Allow single, seamless access and control to data across sources and types • Provide analytics tools and connectivity to other analytical solutions • Offer metadata functionality with data currency and data line‐ age capabilities With a data fabric, companies can design, collaborate, transform, and manage data regardless of where it resides or is generated. In summary, a data fabric is not a single tool or solution that you put into place overnight. Instead, it is the culmination of all the many sophisticated tools that have been created to manage data— from identifying and tagging it, to cleansing it, moving it, governing it, and analyzing it. Plus, something happens when you bring all this together in an architecture that is greater than the sum of its parts. You have a vision for the data-driven future. Conclusion: It’s Time to Act | 25 About the Author Alice LaPlante is an award-winning writer, editor, and teacher of writing, both fiction and nonfiction. A Wallace Stegner Fellow and Jones Lecturer at Stanford University, Alice taught creative writing at both Stanford and in San Francisco State’s MFA program for more than 20 years. A New York Times best-selling author, Alice has pub‐ lished four novels and five nonfiction books. She has also edited best-selling books for many other writers of fiction and nonfiction. She regularly consults with Silicon Valley firms such as Google, Salesforce, HP, and Cisco on their content marketing strategies. Alice lives with her family in Palo Alto, California, and Mallorca, Spain.