Uploaded by James Piyevsky

DataFabricOReily TIBCO

advertisement
Data Fabric as Modern
Data Architecture
Alice LaPlante
Beijing
Boston Farnham Sebastopol
Tokyo
Data Fabric as Modern Data Architecture
by Alice LaPlante
Copyright © 2021 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://oreilly.com). For more infor‐
mation, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Acquisitions Editor: Jessica Haberman
Development Editor: Gary O’Brien
Production Editor: Kate Galloway
Copyeditor: Audrey Doyle
June 2021:
Proofreader: Christina Edwards
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Kate Dullea
First Edition
Revision History for the First Edition
2021-06-02:
First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Fabric as
Modern Data Architecture, the cover image, and related trade dress are trademarks
of O’Reilly Media, Inc.
The views expressed in this work are those of the author, and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
This work is part of a collaboration between O’Reilly and TIBCO. See our statement
of editorial independence.
978-1-098-10592-1
[LSI]
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1. Why Build a Data Fabric? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Limits of Existing Data Architectures
What Success Looks Like
4
5
2. What Is a Data Fabric?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
The Architectural Pattern of a Data Fabric
Building a Data Fabric Is a Journey
11
13
3. How to Get Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Five Pieces of Advice for Getting Started on a Data Fabric
Best Practices When Managing and Growing Your Data
Fabric
Conclusion: It’s Time to Act
16
20
24
iii
Introduction
Your business faces major changes due to digital transformation.
The only way to thrive during the complex transitions that are the
inevitable part of your transformation journey is through data. By
treating your data as the strategic asset that it is, you can successfully
complete your journey in a way that differentiates you from the
competition.
The good news is that most businesses today understand this:
according to an Ernst & Young survey, more than 80% of organiza‐
tions view data as a strategic asset.1 But this doesn’t mean they’re act‐
ing in a way that allows them to get the most from their data: less
than half (49.5%) have put a formal data strategy in place, and less
than 30% of financial executives said they have fully weighed the
costs of poor-quality data.
This is because becoming data driven and accomplishing the admit‐
tedly ambitious goal of fully democratizing your data is not particu‐
larly easy. Each of the “five Vs” of big data—volume, velocity,
variety, veracity, and value—has its own challenges. But the one
we’re going to focus on in this report is arguably one of the top chal‐
lenges to getting the most out of your data: variety. And not just the
variety of data structures, formats, and types, but the variety of data
meanings (i.e., semantics).
1 “Data: The Strategic Asset,” Financial Executives Research Foundation, Inc., November
2019, https://oreil.ly/uEflr.
v
Of the five Vs, variety references the different types of data that can
exist. When data variety is high, the complexity of the data increa‐
ses, which is the chief reason businesses are seeking data fabrics:
they have X sources of data, and every source has hundreds of
tables, each with dozens of columns. At the same time, with all these
sources of data they must serve Y users or use cases, each requiring
slightly different data.
Whether data is structured or unstructured is only the beginning of
the complexity facing businesses today. Most are familiar with these
two categories (three if you add semistructured data) and have fig‐
ured out ways to integrate them. But there are a number of other
challenges specifically concerning the variety of data. Chief among
them is performing analytics with mixed-modal data—since tradi‐
tional analytics is designed to work with highly formatted data and
doesn’t like inconsistent or noisy data. This makes it hard to inte‐
grate different types of data together, which is why data lakes are
notoriously difficult to manage. Finally, the quality of data that
exhibits a lot of variety can be low.
A subset of data variety is data distribution. That is, we would argue
that it’s not only the different types, but the number of sources that
raise challenges, especially when considering how much data is
being created and stored in the cloud.
Essentially, data is everywhere, and it is all different. This includes
Internet of Things (IoT) data from a distribution warehouse, realtime SAP transactions, and Salesforce or other software-as-a-service
(SaaS) datasets. All of these sources may involve customer data of
some kind, but each has a different purpose and different data
consumers.
All the silos in all the departments, each with its own set of tools and
techniques, business rules, and definitions that must be orchestra‐
ted, also add to the complexity.
Questions arise. Where is the data? What kind of data is it? How can
I get the data to the users who need it?
Centralization implies control, and some companies are still pursu‐
ing the goal of having only one, centralized source of data (we’ll
explain why this is not necessarily such a good idea later in this
report). Unsurprisingly (to us), only 6% of companies have achieved
vi
|
Introduction
this, according to a recent survey from the Business Application
Research Center.2
On the other hand, most companies use multiple data sources (see
Figure I-1). Almost one in five companies (18%) use 20 or more
data sources for decision-making, with this number expected to
grow to 50% in the near future, according to the BARC survey. But
the more data sources you have, the more likely it is that data quality
will be a problem. Data governance thus becomes even more critical
as dependence on multiple data sources increases.
Figure I-1. Most companies use multiple data sources for decisionmaking (Source: BARC)
A recent research report by Dimensional Research found that many
businesses aren’t fully leveraging the data they possess, because these
data variety challenges are making it difficult to build or operate
data pipelines.3 Almost half (44%) say that critical data is not yet
usable for making decisions, and 68% say they can only get further
insights from existing data if they have more time.
Additional findings include the following:
• 59% of companies use 11 or more data sources.
• 98% of participants say their pipelines break.
• 51% say this breakage happens more than once a month.
• 91% say data source availability challenges are the reason why
pipelines break daily.
2 “How Many Data Sources Do Companies Rely On For Decision-Making?” Business
Application Research Center (BARC), accessed May 6, 2021, https://oreil.ly/vmEpi.
3 “New Survey Finds More Than Two-Thirds of Companies Leave Valuable Data Untap‐
ped,” Business Wire, March 11, 2021, https://oreil.ly/p5W4A.
Introduction
|
vii
• 66% report less operational efficiency as a result of broken
pipelines.
• 59% report delayed decisions or lost opportunities because of
broken pipelines.
Finally, data complexity can also be caused by data naming conven‐
tions. When businesses use technical data specifications such as
table names and column names instead of the business terminology
users are familiar with, miscommunications and inconsistencies
invariably arise. If a certain kind of data is called different names by
different systems—for example, if the definitions of order entry and
receivables in Salesforce are different from those in SAP—you’ve got
additional complexity to factor in.
A Changing World That Needs Data
Democratization
In addition to these challenges with the data itself, we’re also chal‐
lenged by a world that is in the middle of a major organizational
transition. With the onset of the COVID-19 pandemic, office work‐
ers began working remotely, and many may continue to do so once
social distance restrictions are fully lifted. In addition, we’re seeing
more mobile workers, and even people working nomadically with
no fixed office.
Indeed, mobile users and so-called digital nomads are causing busi‐
nesses to think in new ways about the user experience. Data analysts
who sometimes work from home, sometimes on the road, and
sometimes from a café need the same secure access to data that they
need at the office. Easier, simpler tools are required.
But sometimes it can feel like we’re taking two steps forward and
one step back. By 2019, almost half (48%) of businesses said they
competed using their data, according to the 2019 NewVantage Part‐
ners survey on big data.4 This showed progress; in NewVantage’s
2006 survey, only 5% of large organizations said this.
4 “Big Data and AI Executive Survey 2019: Executive Summary of Findings,” NewVant‐
age Partners LLC, accessed May 6, 2021, https://oreil.ly/AnyH8.
viii
|
Introduction
In NewVantage’s 2020 report, however, the news was not particularly
good. Although investment in data was up, showing that companies
generally realize data’s importance, the pace of that investment was
losing momentum. The percentage of companies investing more
than $50 million in data was 65% in 2020, compared to just 40% in
2019. But only 52% of companies were increasing their rate of
investment, compared to the 92% that were doing this in 2019.5
Worse, only 38% reported that they had created a data-driven orga‐
nization. Even fewer—only 27%—had built a data culture. This tells
us that the all-important goal of data democratization is not being
reached. And it’s not necessarily the technology that is holding firms
back. Nine out of 10 companies point to people and process chal‐
lenges as the biggest barriers to data democratization.
Opportunities Abound—with the Help of a
Data Fabric
By enabling a distributed, mobile workforce and democratizing
data, businesses today can do the following:
• Increase operational efficiencies
• Better calibrate the right pricing for their goods and services
• Personalize sales and marketing initiatives
• Improve the customer experience
• Identify fraudulent transactions
…and much, much more.
Until fairly recently, data scientists and analysts squandered 80% of
their time wrestling with data and spent just 20% exploring it. That
used to be the rule. But IDC’s research director of data integration
and data intelligence software, Stewart Bond, reported last year that
this rule is starting to bend. IDC’s December 2019 data culture sur‐
vey found that knowledge workers are spending closer to 30% of
5 “NewVantage Partners Releases 2020 Big Data and AI Executive Survey,” Business Wire,
Jan. 6, 2020, https://oreil.ly/SPTy5.
Introduction
|
ix
their time finding insights.6 The new 70/30 rule is a substantial
improvement on the 80/20 one.
This is becoming possible because businesses are organizing and
managing their data in smarter ways. In particular, by using some‐
thing called a data fabric.
In this report, we’ll first describe the conditions that are pushing the
limits of current data management strategies. Then we’ll explain
what a data fabric is, including its components and its architecture.
We’ll highlight the benefits and some early use cases. Finally, we’ll
provide five pieces of advice for getting started on deploying a data
fabric in your organization, along with some best practices for mak‐
ing sure you’re doing it right.
6 Stewart Bond, “End-User Survey Results: Deployment and Data Intelligence in 2019,”
IDC, November 2019, https://oreil.ly/EihMu.
x
|
Introduction
CHAPTER 1
Why Build a Data Fabric?
Why do you need this thing called a data fabric? It’s not just because
of the sheer size of your data. You also are faced with access and
integration challenges because of where the data is coming from,
where it’s stored, and in what form. You’ve got data on premises. In
the public cloud. In private clouds. You have data in multicloud and
hybrid cloud ecosystems. Within these various silos, some of the
data is structured but most is unstructured, which raises challenges.
And don’t forget streaming data—that’s an important part of the pic‐
ture, too.
What’s the state of enterprise data, then? Fragmented. A full 93% of
enterprises have a multicloud strategy, with 87% having a hybrid
cloud environment in place, according to Flexera’s 2020 State of the
Cloud survey.1 On average, companies have data stored in 2.2 public
and 2.2 private clouds, as well as in various on-premises data reposi‐
tories (see Figure 1-1).
Businesses are pushing the limits of what they can do
with existing data management tools.
1 Tanner Luxner, “Cloud Computing Trends: 2021 State of the Cloud Report,” Flexera,
March 15, 2021, https://oreil.ly/skemo.
1
Figure 1-1. The fragmented state of enterprise data (Source: Flexera)
The reasons for this fragmentation are varied, and include the
following:
Time-to-data-insight is a competitive differentiator
Today nearly every business transformation—whether aiming
for greater customer intimacy, more optimized operations, or
faster innovation—is fueled by data-driven insights. The days
when business users would patiently wait weeks or even months
for IT to deliver new datasets are gone. Not only are your users
demanding rapid responses to their queries, but the competitive
nature of today’s markets requires it. The dilemma is that quer‐
ies on databases with billions of records can take hours to
return. The need to change this is urgent, as companies with
data intelligence shared in real time or near-real time are 18
times more likely to make better and faster decisions than their
competitors.2
Demand for self-service data continues to explode
Enabled by easier-to-use, more powerful analytics tools such as
Power BI and Spotfire, business users are demanding more data,
delivered more swiftly. Whether you consider this data democ‐
ratization or data chaos, the trend is very real, and data users’
needs must be satisfied for your organization to maintain a
competitive edge.
2 Adam DeMattia, John McKnight, Jennifer Gahm, and Monya Keane, “Research Proves
IT Transformation’s Persistent Link to Agility, Innovation, and Business Value,” The
Enterprise Strategy Group, Inc., March 2018, https://oreil.ly/sAZUW.
2
|
Chapter 1: Why Build a Data Fabric?
Data’s relentless growth and fragmentation are accelerating
The volume of data today is such that no organization can hope
to centralize all its data in one place. Businesses must accept the
fact that data is going to be everywhere and will get used every‐
where by virtually everyone: in computer rooms, on desktops
and mobile phones, on IoT-connected devices on the factory
floor, and by third parties, including customers, vendors, part‐
ners, and more. The dream of centralization is over. Something
needs to replace it.
The increasing complexity of data analytics means the status quo keeps
changing
The theory of evolution by natural selection in Darwin’s On the
Origin of Species is not about an organism’s ability to thrive
based on its absolute fitness per se, but its fitness to adapt to an
ever-changing environment. The same is true in the data uni‐
verse. Increases in complexity and in the speed of innovation
are wreaking havoc on existing data strategies, architectures,
infrastructure, and more. If your goal is to have a data-driven
competitive advantage, all of these things must be agile, and
must be able to evolve as necessary.
The data analytics skills shortage persists
And don’t forget the human dimension. For everyone from
database analysts to data stewards, data engineers to developers,
and business analysts to data scientists, workloads are expand‐
ing exponentially, far faster than your human resources can
handle. This slows down your ability to get value from your
data and reduces your relative competitive advantage.
And just as there are more data sources, there are more—and differ‐
ent kinds of—data consumers. In addition to data scientists and data
analysts, you’ve got business users, executives, customers, suppliers,
and partners such as distributors and retailers. You’ve even got
machines as consumers—IoT devices at the edge of your network,
both producing and analyzing data.
Added to this are the demands of all the newly remote and dis‐
tributed workers who can be located in the next city, the next state,
or across the planet.
Why Build a Data Fabric?
|
3
You need a new, flexible solution to cope with all of this—one that
can achieve the following, arguably difficult-to-hit, objectives:
• Simplify data democratization
• Unify your data environment
• Eliminate data silos
• Centrally coordinate data flows
• Scale easily, to keep up with increasing data volumes
• Span all datatypes
• Align IT with the business
• Empower remote and mobile workers
The Limits of Existing Data Architectures
Current methods of managing data that attempt to meet all the
objectives using data warehouses and data lakes frequently don’t
succeed, because they never include all the data that is needed. But
they still remain important components in a larger distributed data
landscape.
Although data warehouses can solve your integration challenges for
much of your data, they never actually integrate all the data. Addi‐
tionally, they’re inflexible. You won’t get the agility you need to
respond to your users’ requirements. Finally, applying AI technolo‐
gies like machine learning (ML) is a more demanding task than
most data warehouses can cope with—in terms of both the volume
of data required and the complexity of the integrations.
Alternatively, data lakes can hold unstructured as well as structured
data, but it can be difficult to actually find and integrate different
datasets as a lake continues to grow. The more data that is placed
into a data lake, the more difficult it is to manage it, much less
squeeze value from the vast quantities. The popular term for this
scenario is data swamp, and it’s something you definitely want to
avoid. Although data lakes can be good options for inexpensively
processing large and relatively simple datasets, they are constrained
from effectively managing today’s complex, multifaceted data that
businesses want to locate and analyze swiftly for immediate insights.
4
|
Chapter 1: Why Build a Data Fabric?
What Success Looks Like
If you manage to address all the challenges, your rewards will be
substantial. Here’s a taste of what’s to come. With a data fabric you
will get the opportunity to do the following:
Fuel your data-driven business
Support multiple, diverse users and use cases with a modern,
distributed data architecture, shared data assets, and optimized
data management and integration processes.
Accelerate value realization
Accelerate time to value by unlocking your distributed onpremises, cloud, and hybrid cloud data, no matter where it
resides, and delivering it at the pace of your business.
Empower your people with timely, consistent, and trusted data
Democratize data access to arm business users with all the data
required to make faster and more accurate business decisions.
Empower remote and distributed workers as much as your tra‐
ditional office workers.
Benefit from technology innovation sooner
Embrace new data and analytics technology advancements such
as data science, real-time data, and the cloud faster to stay ahead
of your competition.
Save time and money
Streamline data management and integration processes and
pipelines via an optimized combination of intelligent, con‐
verged data management and integration capabilities that
embed AI/ML and business self-service.
Govern and comply with confidence
Ensure proper data governance and control so that you can
deliver the right data at the right time, securely, and in compli‐
ance with your ever-changing regulatory landscape.
To achieve all this you need a data fabric. We’re going to define a
data fabric more precisely in Chapter 2, as there are various conflict‐
ing definitions for it. Although it is a relatively new term, the impor‐
tance of what it does is not new. For years, enterprises have
struggled to integrate all their data into a single, scalable platform. A
data fabric describes a comprehensive way to achieve that goal.
What Success Looks Like
|
5
CHAPTER 2
What Is a Data Fabric?
Let’s start with what a data fabric isn’t. It is not a single product or
even a single platform. You can’t buy and deploy it overnight. It is an
architecture. And a journey.
The good news is that you don’t have to rip and replace your exist‐
ing technology. A data fabric encompasses the data ecosystem you
have in place. Neither do you need to be beholden to a single ven‐
dor. You can choose best-of-breed solutions and—in theory at
least—they should all work together within your data fabric.
To summarize what we discussed in Chapter 1, with a data fabric
your users will get to spend more time analyzing their data than
wrangling with it. And other consumers of data—think systems and
applications—will get access to integrated data. It’s as simple as that.
The data fabric is there to make it easier to find data in a way that’s
trusted and gives access to anyone. This is the frame for our entire
data fabric discussion: that a data fabric will drive the old 80/20 rule
(now 70/30) to increasingly favorable proportions.
Some people call it data intelligence rather than data fabric, because
it makes it easier for users and systems/applications to intelligently
find, work with, and clean data, and apply AI models to it.
So what is a data fabric?
A data fabric is a modern, distributed data architecture that includes
shared data assets and optimized data management and integration
processes that you can use to address today’s data challenges in a
unified way.
7
Despite what many vendors might claim, a data fabric is not a single
product or specific platform that you can simply buy and insert into
your existing data architecture. It includes architecture, shared data
assets, and data management and integration technology.
A data fabric supports the following:
Data for all users and use cases
Provides timely, trusted, reusable data for a wide range of ana‐
lytical, operational, and governance use cases, as well as busi‐
ness self-service users
Data from any and all sources
Accesses, combines, and transforms both in-motion and at-rest
data from across a diverse, distributed data landscape using
metadata, models, and pipelines
Data that spans any environment
Flexibly spans distributed on-premises, hybrid, and multicloud
environments
In short, a data fabric’s job is to connect any kind of data to any‐
where and anyone (or anything). That’s admittedly a tall order, as IT
systems are getting more complex as users demand simplicity for
easier, faster decision-making. A data fabric addresses both needs.
Let’s be very clear that many of the components that make up a data
fabric are not new. They’re constantly evolving, true—especially
when the cloud is involved. But it’s the combination of them that cre‐
ates this new thing, this data fabric.
Here are some of the components of a typical data fabric:
Data catalog
Allows you to categorize, access, and collaborate around com‐
pany data across multiple data sources, while enforcing strong
governance and access management.
Master data management
Involves creating a single master record for all business data
from across both internal and external data sources.
Metadata management
How you manage the data that describes other data (the meta‐
data). It involves establishing policies and processes that ensure
8
|
Chapter 2: What Is a Data Fabric?
information can be integrated, accessed, shared, analyzed,
maintained, and governed across your organization.
Data preparation/data quality
Software that analyzes information and identifies incorrect,
incomplete, or improperly formatted data. Data quality tools
cleanse or correct that data based on rules you establish.
Data integration
The process of taking data from different sources and combin‐
ing them into a single view. Integration begins with data inges‐
tion and includes cleansing; extract, transform, and load (ETL)
processes; and transformation. By integrating data, you make it
possible for users to deploy analytics tools to produce actionable
business intelligence.
Data analytics
The process of examining data to spot trends and draw conclu‐
sions about the information it contains.
Data visualization
Gives you a way to see what the data is telling you. Rather than
being presented in a spreadsheet, table, or some other numeri‐
cally intensive format, the data is graphically represented by
such visual entities as charts and graphs. This makes it easier to
grasp the trends or messages embedded in the data.
Data governance
Enforces data-related policies and maintains data quality. It
helps users establish guidelines, processes, and accountability to
make sure data quality remains satisfactory.
According to a report by Allied Market Research, the global data
fabric market, which was a fledgling $812.6 million niche in 2018, is
estimated to hit $4.54 billion by 2026, representing an impressive
compound annual growth rate (CAGR) of 24% from 2019 to 2026.1
The major key factors contributing to this growth are the increasing
digitalization of various industries, and IoT, AI, and ML adoption
(see Figure 2-1).
1 “Data Fabric Market is Expected to Reach $4.54 Billion by 2026, Says Allied Market
Research,” GlobeNewswire Inc., November 24, 2020, https://oreil.ly/Q16KA.
What Is a Data Fabric?
|
9
Figure 2-1. Industries adopting data fabrics, 2018–2026 (Source:
Allied Market Research)
How the Remote and Mobile Workforce Has
Accelerated Growth of the Data Fabric Market
The first year of the COVID-19 pandemic saw much upheaval. The
need for businesses to instantaneously transform into virtual
organizations, smooth out disrupted supply chains, and in many
cases, completely upend their go-to-market strategies drove them
to take the following actions:
• To establish business continuity, many organizations hastened
their in-progress digital transformations, in some cases achiev‐
ing in months what they had predicted would take years.
• Work-from-home mandates meant data had to be securely
accessible from anywhere, which necessitated that they deploy
a data fabric.
• The mix of data being created and consumed became richer, as
it included much more video communication and more down‐
loaded and streamed video, which underscored the need for
businesses to deploy a robust data fabric.
10
|
Chapter 2: What Is a Data Fabric?
The Architectural Pattern of a Data Fabric
The architecture of a data fabric (Figure 2-2) is organized into a
pipeline of five stages:
Stage 1
Data is collected, often in real time, by a system or person. It
could be a customer service representative interacting with
someone on the phone. It could be a transactional database. Or
it could be a drone or IoT device with a sensor capturing a con‐
stant stream of bits and bytes. You possess one or more of these
sources of raw data.
Stage 2
The data collected in Stage 1 is extracted and loaded into the
database. To do this, you need either extract, transform, and
load (ETL), or more recently, extract, load, and transform (ELT)
tools and processes. Data quality is ensured during this stage as
well, using various deduplication and data cleansing tools.
These are critical, because the single biggest problem with data
is that it’s laden with mistakes—which can occur through man‐
ual entry, by sensors sending erroneous information, or by a
person or system becoming disconnected from the network and
causing gaps in the data. This is where you need to deploy vali‐
dation and release tools and processes.
Stage 3
The data is stored in the database, data warehouse, or data lake.
At the same time, streaming data is being sent from the source
to the final data store.
Stage 4
The data is transformed, optimized, and—most importantly for
a data fabric—virtualized. This is the so-called last mile of get‐
ting data to users. It’s when you have to do a lot of management
associated with the data: master data management (MDM),
metadata management, and reference data management. Data
science AI models are part of the fabric in this stage, because
data scientists synthesize and create new data based on observa‐
tions made from old data.
The Architectural Pattern of a Data Fabric
|
11
Stage 5
The data is delivered to users via tools they can use to browse
through, find, and manipulate it. These tools include visualiza‐
tion tools as well as data catalogs and data stores.
Figure 2-2. The data fabric architecture
Analytics on the Edge
The most important achievement of data fabrics is that they enable
action at the speed of your business, which is often in real time.
This involves capturing, unifying, and making data available
through different systems to be analyzed from wherever it’s needed.
In today’s distributed and mobile world, data could be needed liter‐
ally anywhere in the world.
And because data is usually disparate and distributed, analytics
must be as well. So, analytics tools shouldn’t be centralized, but
need to be accessing data all the way out at the edge. That’s why data
fabrics are so important. If you have a well-defined data fabric, you
can move analytics around your environment in a way that you
could not do previously.
12
|
Chapter 2: What Is a Data Fabric?
Building a Data Fabric Is a Journey
The first thing businesses want to know about their data is whether
they can access it. Next, they want to know whether they can define
and add value to it, and then whether they can gain insights from it.
The last thing businesses ask when it comes to their data is whether
they can make it available quickly enough to derive value from it
using analytics.
Many organizations spend a great deal of time, energy, and money
trying to determine whether they can access their data, and define
and add value to it. But they don’t always leverage it with analytics to
gain insights. And almost all fail to learn from their insights and
then optimize to close that loop. In the future, the way businesses
will compete when data fabrics become commodities will be how
well they execute in a closed-loop analytics system.
Again, we iterate that building a data fabric is not about purchasing
and deploying a single solution. Nor does it happen overnight. It’s a
journey during which you gradually put all of the many pieces in
place (see Figure 2-3).
Figure 2-3. The data fabric journey
You move from working with use-case-specific data to domainspecific data, taking care to engineer processes and choose tools that
will enable you to be consistent and repeat your successes. Only
after you have achieved that should you move to enterprise-scale
data services, where the data fabric supports the data-driven needs
of the entire organization.
Building a Data Fabric Is a Journey
|
13
Embedding AI/ML capabilities within data-fabric-enabling technol‐
ogy can help address rising complexity and reduce workloads by
automating manual processes such as data discovery and matching,
data model design, and query optimization. Additionally, tools that
make it easy for “citizen” data analysts, data scientists, and “citizen”
data engineers to contribute to your data fabric journey can help
you expand your pool of data-analytics-enabled people.
Finally, with subject matter expertise so critical to data quality, data
governance, and the responsible use of data and models, the closer
your business users are to the various aspects of your data fabric, the
more successful you will be. Your business-domain experts will help
you monitor your data at the grassroots level to ensure it is being
used wisely and well.
14
|
Chapter 2: What Is a Data Fabric?
CHAPTER 3
How to Get Started
The business value of a data fabric is clear. It provides one place to
go for data, for better and faster insights. It offers consistent, highquality, governed, and secure data to everyone throughout your
organization. It simplifies your journey to data democratization.
And your users will spend less time searching and more time ana‐
lyzing. Table 3-1 shows five prime use cases for data fabrics, the
challenges they solve, and the benefits they bestow.
Table 3-1. Use cases for data fabrics
Data fabric use
case
Customer
engagement
(customer 360)
Data challenges solved by data fabric
Benefits
• Customer data is spread across
multiple systems, making it difficult to
truly understand customers.
• Incomplete data impacts sales and
marketing effectiveness.
One place to go for customer data:
Line-of-business
(LOB) operations
(operations 360)
• LOB operations data is distributed in
silos.
• Incomplete data prevents optimizing
resources, increases costs, and reduces
customer satisfaction.
New-product
innovation
• Lack of access to data inhibits
collaboration spanning multiple
groups and systems.
• Lack of real-time data access deters
speed-to-market and optimization of
R&D resources.
• Revenue acceleration
• Lower cost of sales
• Higher return on marketing
spend
One place to go for LOB data:
• Lower operational costs
• Faster response to changing
operational dynamics
One place to go for R&D data:
• Faster time to market for new
products
• Higher return on R&D spend
15
Data fabric use
case
Compliance
Risk
management
Data challenges solved by data fabric
Benefits
• Systems are built for operations, not
compliance.
• It’s difficult to meet evolving
compliance requirements and
variations of rules across geographies
and LOB operations.
One place to go for compliance
data:
• Systemic risk spans organizational and
system data silos.
• There is no complete view of risk
factors and metrics.
• Comply faster, with fewer
resources
• Stay out of jail, stay out of the
media, and avoid fines
One place to go for risk
management data:
• Provide a full view of risk
• Help avoid catastrophes
Create a Data Fabric Center of Excellence
Most users aren’t interested in technology frameworks and archi‐
tectures. Although relying on them, they don’t need to be exposed
to them. The most successful data fabrics are organized by creating
a center of excellence (CoE) around analytics and data. A data fab‐
ric CoE allows you to abstract the technology away from the busi‐
ness users, giving you a better chance of truly democratizing it. In
effect, a CoE is a “village” of expertise within your business. And it
does take a village to leverage analytics in a highly scaled way across
big enterprises. Your users, from business analytics to the CEO, just
need to trust that village.
Five Pieces of Advice for Getting Started on a
Data Fabric
Like most important business initiatives—and make no mistake,
data fabrics are important business initiatives—support for a data
fabric must come from the very top of your organization. The Csuite must be stakeholders. And that means more than the CIO or
CTO. The CEO and COO should be involved, too.
Another piece of advice common to most technology ventures is to
start small. Pick a project, consolidate a small data fabric underneath
that project, and then build on that success. A lot of companies start
with a large-scale deployment, and three years later they’re still
working on it, with little sign of closure.
16
|
Chapter 3: How to Get Started
In addition to these two pieces of advice that are given for most
technology initiatives, here are five more suggestions that are spe‐
cific to building a data fabric.
One: Virtualize, Don’t Centralize, Your Data
Centralizing your data was a best practice three decades ago, and
many enterprises are still struggling to make that vision a reality,
even as data volumes and complexity grow. But today, with data
being generated and stored everywhere from the data center to the
edge to the cloud—and needing to be accessible to users who could
be located anywhere in the world—having a single centralized “sys‐
tem of record” doesn’t work anymore.
Data virtualization is the answer. Data virtualization is a way to
manage data so that users—or applications—can access it, retrieve
it, and use it without having to know where it is physically located or
how it is formatted. It can make all your data sources look as though
they are located in one centralized data store even when they are
physically located all across the planet.
In a data fabric, you want to create virtual views of your data, not
actually move the data to a centralized location. You can apply secu‐
rity controls and authentication protections as though they were all
in one place. And it makes things a lot easier for your business
users. For a robust data fabric, you definitely want to virtualize, not
centralize, your data.
Two: Build an Intelligent Data Fabric by Integrating
AI into It
Even as you’re trying to democratize your data, there’s the beginning
of a movement to do the same with AI. By embedding AI into the
data fabric itself, you encourage your data scientists to more easily
build algorithms—and, more importantly, to pass those algorithms
on to your business users for faster and smarter decision-making.
This is important, because to derive the most value from your data,
you can’t let these AI models only be used by the data team or lan‐
guish in a digital vacuum. You need to operationalize your models—
which means deploying these ML algorithms across your entire
organization.
Five Pieces of Advice for Getting Started on a Data Fabric
|
17
For example, data scientists could build predictive analytics models
and embed them in the data fabric so that business intelligence tools
like Tableau or Power BI can handle requests such as the following:
Show me how many tractors we sold in Wyoming last month, and
tell me how many are likely to be sold there over the next six
months.
Users can then ask increasingly complex questions and the models
will parse the data to answer them. Giving users access to these
magic learning models with just a few clicks dramatically accelerates
data democratization and leads to better business outcomes.
Three: Automate Virtualization of Your Streaming Data
In 2021, automation is the name of the game. Whether you’re talk‐
ing actual, physical robots that work assembly lines in modern fac‐
tories, or software robots (bots) that automate digital business
processes, enterprises are trying to automate everything that can be
automated. Automation requires data. So, ubiquitous automation
means making data universally available to applications through vir‐
tualization, as we discussed in the first piece of advice.
But things get more complex when you talk about streaming data—
also called data in motion. Unlike data at rest—data that’s stored in a
database—streaming data comes into your organization every day in
countless ways: from your website, ecommerce store, the internet,
embedded sensors on network devices, and more. But unfortunately,
the majority of data visualization tools only work on data at rest.
That’s why getting up-to-the-second readings of, say, your data cen‐
ter’s electric meter, is so difficult. To do this, the system needs to
retrieve data that can change several hundred times per second. You
can try to tame streaming data by putting it in a database. But this is
like trying to watch a movie as a series of still photographs. So much
is lost. If you do this in business, opportunities are lost.
That’s why you need streaming data virtualization to be automated.
Doing this connects data in motion to business systems, transform‐
ing each piece of data—each “event”—into a row in a table. This is
done in real time. Streaming data virtualization then revitalizes
those events. By automating streaming data, you give your business
users access to all your enterprise’s data in motion, such as network
device readings, drone data, and even weather forecasts, to analyze
and derive value from it.
18
|
Chapter 3: How to Get Started
Four: Create a Data-As-A-Service Offering for Your Users
Like all “as-a-service” technologies, data as a service (DaaS) is based
on the idea that data is a product that can be given to users (or
applications) on demand via the cloud, no matter where they are
located or what kind of device they are using.
This is good for users, because the burden of managing the data—
which can get quite complex—and the storage system(s) that houses
it falls onto the DaaS provider. The user just takes what is needed
and doesn’t worry about any of the backend mechanisms. At its
most basic, DaaS is just a new way to allow users to access data
easily and without hassles.
DaaS eliminates the need to put REST interfaces on your data.
Instead, you can directly create APIs from data, which can be inte‐
grated with API management tools that turn those APIs into man‐
aged services. This makes it simple to provide data to your users in a
safe and easy way.
Five: Create and Nurture a “Data Curation” Culture
Your data is a valuable asset—probably the most valuable asset your
organization has. The irony is that you can’t protect it by keeping it
locked up. The value comes from using it. But you have to be careful
that it doesn’t get abused, not just by external hackers or cybercrimi‐
nals, but by your employees, who may inadvertently misuse or mis‐
handle data.
Because of this, your data team should, of course, include data stew‐
ards, who are the designated guardians of the data and who put pro‐
cesses in place to ensure that it is used appropriately. They are also
data advocates, and attempt to get users enthused about and careful
about handling the rich troves of data your organization possesses.
But you shouldn’t stop by deputizing formal stewards. You should
make everyone aware of the importance of clean, trusted data, and
work together to ensure that this very valuable corporate asset is
both used and cared for appropriately.
Here’s an example: In 2020, when COVID-19 hit, Panera Bread
pulled off a dramatic pivot. The fast-food, fresh-lunch restaurant
transformed its business model within 10 days. In addition to selling
prepared food, Panera decided to sell milk, yogurt, tomatoes, and
avocados. Panera leveraged its team of 50,000 people and a network
Five Pieces of Advice for Getting Started on a Data Fabric
|
19
of 2,000 stores to deliver items in 40 minutes through its delivery
network of 10,000 drivers, locations with pick-up, and on Grubhub.1
Panera’s transformation was made possible by a culture of data cura‐
tion. Faced with increasing data, new complexity, and manual pro‐
cesses, Panera CIO John Meister established a vision called “One
Panera.” One Panera placed responsibility for data in the hands of
business users as well as partners. Instead of putting data behind
walls, One Panera democratized it. Today, Panera Bread’s Menu
Master Data Space provides a single view of 135,000 prices for items,
including price tiers and categories, that its own business users
curated.
This was possible because of the data tools Panera put into place that
did the following:
• Enabled knowledge workers to curate and monitor business
data in a self-service way
• Easily extracted metadata from databases
• Incorporated AI models created by data scientists
• Helped knowledge workers identify and fix quality issues
When integrated with Panera Bread’s data fabric, these tools effec‐
tively made the entire company a part of the data team, making the
business agility that Panera so deftly demonstrated possible.
Best Practices When Managing and Growing
Your Data Fabric
Once you have your data fabric in place, you will want to maintain
and enhance it over time.
Rome was not built in a day. Nor will a data fabric. Start your data
fabric journey with small wins that deliver on immediate needs,
prove new architecture concepts, and deliver value. Upon those suc‐
cesses, you can grow purposefully toward your vision.
1 “Panera Cooks with Data to Deliver on Service and Satisfied Customers,” TIBCO Soft‐
ware, Inc., accessed May 25, 2021, https://oreil.ly/X7wgd.
20
|
Chapter 3: How to Get Started
Embrace Technology Evolution and Convergence
Your business won’t stand still. And technology never does. Not only
is technology improving, but a huge convergence is underway that
you can leverage when you modernize your technology as part of
data fabric adoption. So, look for modern, convergence-enabling
solutions that span traditional technology silos, such as:
Analytics convergence
Self-service visualization, data science, streaming analytics, and
reporting
Data management convergence
Metadata management, MDM, data governance, data quality,
data cataloging, data modeling, data integration, and data
security
Cloud convergence
Storage, computing, development tools, applications, and
marketplaces
Data analytics and transactional applications convergence
Converging technologies and teams
IT/Operational technology (OT) convergence
Data in motion and data at rest
Make Sure Your Data Fabric Is Truly Holistic
Do you really consider data to be your most important asset? If so,
to drive the highest impact from it you need to stop treating your
transaction data, streaming data, metadata, master data, and refer‐
ence data assets as unique, unrelated domains and instead begin
managing what is common across your data assets. A holistic data
fabric enables synergies that simplify governance, security, control,
and much more. More holistic approaches are often enabled by data
management technology convergence in general, and via closer inte‐
gration within the fabric in particular. For example, choose a single
MDM platform capable of managing customer, product, employee,
location, and other key datatypes consistently, rather than using dif‐
ferent MDM platforms for different domains.
Best Practices When Managing and Growing Your Data Fabric
|
21
Support Today’s Distributed Data Analytics Topology
Data is anywhere and everywhere. The same can be said about your
data consumers. With so much data and so many users, it’s impossi‐
ble to support traditional data centralization paradigms such as data
warehouses and even the more modern data lakes and lake houses,
which today are just additional data sources. You need to embrace
decentralization, as it is a much better solution for today’s dis‐
tributed data analytics topologies. Look for a data fabric that can
span data from any source and deliver it to any consumer, anywhere.
Augment Your People Using AI and ML
Most organizations lack sufficient numbers of skilled IT staff mem‐
bers and are finding it difficult to recruit and retain data experts in
today’s competitive markets. AI and ML are obvious solutions.
According to a recent survey, more than 50% of companies have
completed at least one AI initiative, with 66%, on average, seeing
increased revenue across a broad range of functions (Figure 3-1).2
Figure 3-1. AI is returning high ROI (Source: McKinsey)
2 “The State of AI in 2020,” McKinsey & Company, November 17, 2020, https://oreil.ly/
qPnJA.
22
|
Chapter 3: How to Get Started
Keep It Open and Flexible
Stay open to change and build an architecture that is meant to
evolve. Furthermore, because of the risk inherent in big-bang rear‐
chitecture initiatives, remember that much of your current state will
continue to change for a long time. So, leverage open standards and
APIs to facilitate the inevitable transitions. Finally, start your futurestate data architecture journey with small wins that prove new archi‐
tecture concepts and deliver value. Upon those successes, you can
grow purposefully toward larger architecture transformations.
Design It for Easy Decoupling and Layering
Decoupling lets you separate how you manage data from how you
consume it. As a result, you can manage each datatype optimally,
within the original source, in an on-premises data warehouse, in a
cloud data lake, on an edge device—wherever it makes the most
sense from a storage and management point of view—and keep it
independent from however you choose to provision it to your busi‐
ness users. Decoupling also helps your business users because it lets
you free them from knowing IT internals such as schemas and syn‐
tax. Instead, you can provide a consistent, secure, and governed
common data layer consisting of shareable, business-friendly data
objects that make it easy for your business users to find and use all
of your most important data.
Migrate Intelligently
Data virtualization lets you insulate your consumers as you migrate
your data sources and insulate your sources as you migrate your
consumers. And when you’re finished, you can continue to drive
value from these decoupled data objects.
Don’t Over-Innovate
Keep privacy and governance in mind. Just because you can do
something with your data doesn’t mean you should. A major retailer
got into trouble several years ago by using its data too cleverly, in a
way that violated the privacy of its consumers.3 Always think of the
3 Charles Duhigg, “How Companies Learn Your Secrets,” New York Times Magazine, Feb.
16, 2012, https://oreil.ly/p7QI7.
Best Practices When Managing and Growing Your Data Fabric
|
23
origin of the data, and make sure you don’t harm customers or part‐
ners by using it to take action.
Keep Your Processes Standard
Focus on a process layer to repeatedly leverage what you’ve built in a
repeatable, governed, standardized way. Just integrating systems
together is not enough. You must have processes and controls in
place to make sure the system is used consistently throughout your
organization. It’s like having a well in the center of a village, where
everyone goes to collect water. The bucket needs to say the same.
The rope needs to stay the same. The crank needs to stay the same.
So, standardize as many of the functions and processes of accessing
and using data as possible.
Conclusion: It’s Time to Act
According to NewVantage Partners’ latest data survey, most organi‐
zations have a lot of work to do when it comes to getting value from
their data:4
• Only 48.5% drive innovation with data
• Only 41.2% compete on analytics
• Only 39.3% manage data as a business asset
• Only 30.0% have a well-articulated data strategy
• Only 24.4% forge a data culture
• Only 24.0% create a data-driven organization
Clearly, it’s time to do something about this. Building a data fabric is
the answer. A data fabric enables organizations to become truly data
driven, empowering them to meet the demands of their businesses
and gain a competitive edge.
Other benefits of building a data fabric encompass removing data
silos, gaining control over your data, managing it consistently across
multiple environments—on premises, cloud, hybrid, and edge—and
reducing hassles of data integration. When you have your
4 “Big Data and AI Executive Survey 2021,” NewVantage Partners LLC, January 2021,
https://oreil.ly/hHRKP.
24
|
Chapter 3: How to Get Started
enterprise-wide data fabric in place, you will be able to do the
following:
• Perform data management processes on a single unified
platform
• Pull and connect or collaborate on data from disparate sources
across locations
• Manage data across all environments (multicloud, hybrid, and
on premises)
• Allow single, seamless access and control to data across sources
and types
• Provide analytics tools and connectivity to other analytical
solutions
• Offer metadata functionality with data currency and data line‐
age capabilities
With a data fabric, companies can design, collaborate, transform,
and manage data regardless of where it resides or is generated.
In summary, a data fabric is not a single tool or solution that you
put into place overnight. Instead, it is the culmination of all the
many sophisticated tools that have been created to manage data—
from identifying and tagging it, to cleansing it, moving it, governing
it, and analyzing it.
Plus, something happens when you bring all this together in an
architecture that is greater than the sum of its parts. You have a
vision for the data-driven future.
Conclusion: It’s Time to Act
|
25
About the Author
Alice LaPlante is an award-winning writer, editor, and teacher of
writing, both fiction and nonfiction. A Wallace Stegner Fellow and
Jones Lecturer at Stanford University, Alice taught creative writing
at both Stanford and in San Francisco State’s MFA program for more
than 20 years. A New York Times best-selling author, Alice has pub‐
lished four novels and five nonfiction books. She has also edited
best-selling books for many other writers of fiction and nonfiction.
She regularly consults with Silicon Valley firms such as Google,
Salesforce, HP, and Cisco on their content marketing strategies.
Alice lives with her family in Palo Alto, California, and Mallorca,
Spain.
Download