Uploaded by Abul Hasanat Sekh

EBOOK 6 obstacles analysts face

advertisement
eBook
6 obstacles analysts
face for self-serve
analytics and how to
overcome them
Strategies and tips for implementing self-service
analytics for a data mesh architecture
1
The promise of self-service
analytics in a data mesh
architecture
Forward thinking, data-driven organizations have embraced modern data architectures
as a way to better manage their data and deliver value at scale. The rise of decentralized
environments such as a data mesh architecture are taking this to a new level, enabling
organizations with even greater flexibility, scalability, and seamless data movement. A data
mesh is also a highly effective way for organizations to improve self-service analytics by
allowing data analysts and business users faster access to the data they need without
having to transport it to a data lake or warehouse - greatly reducing their reliance on IT and
engineering. For any business that wants to truly benefit from this, it is critically important to
remove any blockers standing in the way of the value that self-serve analytics can deliver.
In this ebook, we will cover the six most common obstacles that data leaders should be
aware of when implementing self-service analytics so that you can help your team deliver
high quality data products that provide faster decision-making, and greater business agility.
Raj Bains
Founder CEO, Prophecy
2
Obstacle 1
Slow product delivery
Slow product delivery is one of the biggest obstacles data leaders face when it comes to
implementing self-service analytics. When data products take a long time to deliver, decision
makers may not have the information they need to make timely and informed decisions.
This can result in missed opportunities or worse, decisions based on incomplete, inaccurate
or outdated information. Delays in data product delivery can cause frustration and erode
confidence in the data team’s ability to deliver value to the organization.
Here are some of the top factors that contribute to slow product delivery:
Lack of visibility into available data
The inability of analysts to seamlessly view, access, and analyze data from multiple and
diverse sources can lead to significant challenges and hinder decision-making processes.
This may arise due to various reasons, such as incompatible data formats, data silos,
inadequate data integration tools, or lack of proper data governance. Such hindrances can
significantly limit the efficiency and effectiveness of data analysis and create data blind
spots. The analysts may struggle to identify correlations, trends, or patterns, and may not be
able to extract meaningful insights that could drive business outcomes. Thus, it is imperative
for organizations to invest in robust data management and integration solutions to enable
their analysts to access and analyze data from different sources without any impediments.
Tooling not built for data analysts
Most data transformation tooling is not designed with business data user in mind. As a
result, there is often an over reliance on engineering resources to help build data pipelines
that require complex coding. When this occurs, the timeline for engineering to perform
the complicated coding work needed to build data pipelines that adequately perform the
integration and transformation work that feeds new data products can stretch from days to
weeks or even months. This causes frustration for all stakeholders and keeps engineering
from more critical tasks.
Poor data quality assurance
Low-quality data pipelines can be a significant problem for organizations, as they can lead to
a range of issues that can impact the performance and reliability of the system. When these
pipelines are built and deployed, they may not function correctly or may generate inaccurate
data. These problems can go undetected during testing and development, only to be
discovered during the production phase, which leads to needing to be removed and rebuilt.
3
Lack of reusability of data artifacts
When data analysts are provided with quick access to pre-existing data sources, datasets, or
pipelines, it can result in considerable savings of time and resources. Conversely, the lack of
such access can cause significant delays in delivering data products, as analysts are forced to
begin the analysis process anew every time a dataset needs evaluation.
Inability to identify data quality issues
When dealing with data analysis, it is essential to ensure that the data used is of high quality.
Poor quality data can pose a significant challenge, as it may contain errors, inconsistencies, or
even duplicates. Such data can be time-consuming and challenging to work with, and it can
significantly slow down the deployment of data pipelines. Moreover, the insights generated
from such data may not be accurate or reliable, thereby impacting the overall quality of the
analysis. This is because data pipelines rely on the accuracy and consistency of data to
produce meaningful insights. Therefore, it is crucial to invest in data cleaning and validation
processes to ensure that the data used is of high quality, thereby minimizing errors and
inaccuracies in the final output.
The solution
A key concept of a data mesh is the idea of data productization, which involves treating data
as a product that is designed, developed, and maintained by data teams. Ensuring that a data
team is best equipped to help enable self-service analytics and avoid slow product delivery
can be done with a data engineering layer that allows organizations to:
•
Integrate more easily with modern tooling for data engineering and analytics
•
Quickly build real-time data pipelines that access data from disparate source systems
•
Enable more data team members to participate in data engineering
•
More easily collaborate between different teams and departments
•
Ensure data quality to improve data pipeline reliability and performance
4
Obstacle 2
Lack of ownership and
accountability
A lack of ownership is the inability to know who owns a particular dataset, how it has been
transformed, and who is responsible for various aspects of the process. In order to ensure
successful self-service analytics, it is important that data leaders take steps to define who
owns the data, or data product, and how it will be managed.
Without clear ownership, businesses can expect a variety of challenges to trickle down
the workflow:
Poorly defined business and technical requirements
In order for self-service analytics to be successful, it is essential that organizations develop
well-defined business and technical requirements (best practices) which are communicated
clearly to all stakeholders. These include having a clear understanding of what data sources
are available, how data is structured, as well as how often data is refreshed or new data is
added. Not clearly defining this can lead to inaccurate results in analyses and poor decision
making.
Disconnect between data product and value
If there is no clarity into the intended business value of a particular data product, the
trustworthiness of the data product will be low. Even if the data product itself is
performant and can provide good insights, not having this clarity or understanding leads
to lack of usage.
Low data product quality
Delivering data that may be stale or incomplete can result in inaccurate
analysis, incorrect insights, and bad decisions which can do far more harm
to the business than good. Additionally, once a data product is viewed as
unreliable, then it will not be used or will be viewed as unusable, which is a
waste of the time and resources consumed for development.
5
The solution
These challenges often occur when an organization has a fragmented and decentralized data
architecture, which a data mesh would be a good strategic fit for rather than a data warehouse
architecture that would require more centralized ownership of data. This decentralized
architecture is best served by a data engineering platform that can:
•
Automatically ensure data quality
•
Offer an intuitive low-code environment allowing teams to iterate on data products until
requirements are better understood and defined.
•
Build and modify data pipelines quickly and easily, without having to write code,reducing
the time and resources required to create and modify pipelines as requirements become
more defined or evolve
Obstacle 3
Lack of data visibility
Data visibility is how well an organization can monitor, display, and analyze data from disparate
sources. With clear data visibility businesses are able to make better informed decisions
quickly. Organizational success relies heavily on data visibility because the absence of it can
lead to inefficient operations, missed opportunities, increased risk, and poor decision-making.
Put simply, having data visibility is critical.
Here are some things that affect data visibility and how it can impact self-serve analytics:
Understanding where data originates
Self-serve analytics are often based on data that is collected from multiple and disparate
sources. Not having clarity about these sources and the quality of that data can lead to
inaccurate results or incomplete insights.
Controlling data change
When a data source is modified or updated, the information it contains changes. For example,
a sales database may have new entries added or old entries modified when new transactions
occur or customer information is updated. These changes affect the raw data and can have an
impact on how that data is interpreted and used for analytics.
6
Rising data volumes
The introduction of new data to an organization’s data ecosystem leads to an overall increase
in data volumes, which can significantly impact the performance of data pipelines responsible
for moving data between various stages of the analysis process. As the data volume increases,
processing time and latency of the pipelines also increase, leading to potential pipeline failures
and data loss. Furthermore, the introduction of new data can influence the insights derived
from the data, potentially highlighting previously unidentified trends or contradicting existing
data, necessitating changes in analytical models or the creation of new ones.
Unclear data labels and definitions
Labeling data is a critical aspect of data analysis as it provides context and meaning
to information. By adding labels, it becomes simpler to identify, organize, and analyze
information. When data lacks proper labeling, it becomes challenging to perform accurate
analysis and draw reliable conclusions. The analysis of unlabeled data may lead to inaccurate
insights, which can have significant consequences for organizations.
The solution
In a data mesh architecture, data visibility is important because it enables data teams
and other stakeholders to understand and effectively manage the flow of data across the
organization.
This can be done with tooling that allows the data team to:
•
Build and visualize data flows, making it easy to understand data lineage and see how
data is transformed and where it comes from
•
Easily catalog data, including metadata and information about data sources, tables, and
columns which help users understand data structure and context
7
Obstacle 4
Working in silos
Siloed work is one of the more common issues that data teams are faced with, whether their
organization is in growth mode or at enterprise-level. This occurs when different teams or
team members take ownership of certain data sets or data products, work in isolation, and
do not share information, knowledge, or resources with others in the organization. Working in
silos greatly limits collaboration and reduces efficiency within an organization as teams can be
unaware of the work being done by others.
Here’s what that often looks like:
Ineffective communication with data engineers
Limited or poor communication with data engineering can severely impact the effectiveness
of self-serve analytics. Data engineers are responsible for enabling data analysts and other
business data users to create the self-service analytics needed to make informed decisions,
with limited communication, decision making is delayed at best or poor guidance is shared
with the organization at worst.
Duplicate efforts across teams
In organizations with siloed data structures, it is common for different teams to tackle similar
challenges without realizing that their efforts may be overlapping or duplicative. This can lead
to the inefficient use of resources, including time, budget, and manpower.
Broken transformations
When data engineers and data analysts work in siloed structure, engineering is not aware of
the data pipeline work done by analysts and whether these pipelines are performing correctly.
When there is a lack of awareness and engineering does not have visibility into whether data
transformations are being done correctly, the resulting intelligence shared with decision
makers can lead to poor decision making.
Poor data hygiene
Within a siloed data organization, it can be difficult to ensure that data is being properly
maintained, updated, and secured. This can lead to errors, inconsistencies, or other issues that
can compromise the integrity of the data being used.
8
Disconnect with business needs
If data engineering doesn’t have a clear understanding of the downstream use cases for their
data, it makes it impossible to ensure that the data being used is properly structured and
optimized for specific use cases.
Lack of data governance and lineage
When siloed teams are working with their own data sets, there is very little visibility into how
that data has been transformed and used over time and if there are any potential issues or
errors with the data. This impacts efforts to enable self-serve analytics because the data
analysts or business users are totally unable to ensure their data is reliable and accurate.
As data evolves, the opportunity for quality issues can arise without the proper mechanisms to
track modifications, additions, or deletions. This can impact the quality and completeness of
the data which can impact trust and reliability in the data.
The solution
In a data mesh architecture, collaboration is essential for integrating data products, enabling
data sharing, and building data pipelines across different domains. This allows data teams to
unlock the true potential of their data assets and drive data-driven decision-making and value
creation.
Collaboration can be ensured with a data engineering platform that offers:
•
A centralized workspace where data team members can work together in a shared
environment, eliminating the need for disjointed workflows and ensuring everyone is on
the same page
•
Visual and low-code tooling that not only simplifies complex data engineering processes,
but also enables better communication between data team members as they can all
easily understand and contribute to the data pipelines and transformations
•
Robust version control so that data team members can work simultaneously on data
assets, track changes, and merge their work without conflicts
•
Data lineage capabilities that trace the origin and transformation of data throughout
data pipelines, promoting transparency and improved understanding by providing clear
insights into data dependencies and transformations
•
Notifications and alerting features so that team members can receive updates on
changes made by others as they happen and opening up communication opportunities
9
Obstacle 5
Insufficient Service Level
Agreements
Service Level Agreements (SLAs) are critical for ensuring that business users have reliable
access to the data they need when they need it. Without SLAs, users experience inconsistent
and unreliable access to data, resulting in delayed response times, or errors in their analysis.
They may also have difficulty in trusting the data they are using, which can result in inaccurate
insights and incorrect decision-making.
In practice, this may look like:
Performance does not meet business needs
Without SLAs, business users are not able to access the data they need when they need
it, data quality can be poor, and the accuracy of the data to support decision making can
be questionable. Additionally without SLAs in place, there’s no clear accountability when
performance issues happen, which can cause friction between engineering and business
users.
Inability to observe data accuracy and quality
When no SLAs are in place that are specific to data quality, neither data engineering nor data
analysts/business users, can’t be sure that their business intelligence is accurate. Poor quality
data leads to incorrect analyses and insights, and overall poor decision-making. “Garbage in,
garbage out” applies to this situation.
Unclear timelines
Without SLAs in place that enforce overall data availability and quality, there is no guarantee
that data products will be delivered on time, or that it will be of high quality. When there is lack
of clarity into data product delivery, this causes significant delays in data analysis, and can
even lead to failed delivery of data products.
Further, a lack of data freshness, or lack of SLAs that ensure data is current/up to date, can
lead to delays in decision-making and missed opportunities for growth or cost savings.
10
The solution
The right data engineering platform can provide data teams with powerful capabilities that can
ultimately lead to improved SLA compliance such as:
•
Automation of many of the tedious and repetitive tasks (automating workflows) involved
in data engineering helps reduce the chances of errors and delays, which can impact
SLA compliance.
•
Low-code data transformation tooling enables a much larger segment of the data
team to perform data transformations without having to be coding experts, boosting
productivity while lowering the burden on data engineering, and creating a potential
productivity bottleneck, greatly improving SLA compliance.
•
Strong collaboration capabilities ensure that everyone is on the same page and working
towards the same goals, which can help improve SLA compliance.
Obstacle 6
Code quality and complexity
Code quality and complexity each have massive impacts on the success of self-serve analytics
but often go hand in hand. Code quality is an assessment of how well written code is. If
the code underlying data pipelines or driving business intelligence is not easy to read and
understand by engineering or other business users, then the reliability and overall performance
of the code can be highly questionable. Code complexity (which is also an indicator of code
quality) refers to the amount of code used to build and deliver the reporting and intelligence
used to drive business decision making.
Here’s how they impact self-serve analytics:
Code is incomprehensible
Code that’s hard to understand is an indicator of poor code quality. When engineers write
code that doesn’t follow best practices, or code is written in a way that other users can’t easily
follow along with what is happening, it can lead to issues that prevent self-service analytics
from working correctly. This can result in data products that go unused at best, or data
products that are unreliable and deliver inaccurate outputs that negatively impact decision
making at worst.
11
Breaks between development and production
When there is uncertainty that something working in development might break something in
production, it can be difficult to trust the data and insights generated by self-service analytics.
This uncertainty can lead to hesitation when it comes to making decisions based on the data,
as well as a lack of confidence in the accuracy of the results.
Inability to perform unit tests
Unit tests are intended to confirm that individual units of code are functioning correctly and as
expected. Without unit testing, the risk of deploying low quality code that can cause data loss,
unreliable outputs, and inaccurate business intelligence.
Pipelines are difficult to share
When data pipelines can’t easily be shared, it negatively impacts the ability of data team
members to collaborate effectively on these pipelines, especially for QA purposes. This can
lead to a higher risk of inaccuracies in any analysis driven by these pipelines, or pipelines that
perform poorly. From a reusability standpoint, when pipelines can’t be shared easily, there is
a greater risk of duplication of work where multiple pipelines may get developed that do the
same work, which is a waste of time and resources.
The solution
Ensuring that code quality is maintained while reducing the impact of code complexity on selfservice analytics can be tied to the capabilities provided for data engineering. These include:
•
Visual tooling for building data pipelines which simplifies the process of designing and
managing complex data workflows. This also helps eliminate the need for complex code
and reduce the risk of errors.
•
Low-code development which can help reduce the amount of manual coding required
and improve the overall quality of the code.
•
Built in data quality capabilities such as data validation, error handling, and remediation
suggestions
•
Strong collaborative capabilities like commenting and sharing, code reusability, and
version control
12
Delivering on the value of
data mesh with self-serve
transformations
Prophecy is a low-code data engineering platform that combines the power of a visual ETL tool
with the flexibility and performance of custom-written code. Using drag and drop options, users
of varying skill levels can automatically generate high-quality code that can be easily reviewed,
optimized, and debugged on your cloud data platform of choice with no vendor lock-in.
Furthermore, Prophecy enables all data users to build data pipelines that adhere to the
principles of data mesh, such as domain-driven design and event-driven architecture, and can
be easily shared and reused across different teams or domains.
Prophecy empowers data users to implement data mesh with a self-serve, low-code platform, enabling
all data users to visually transform and ship trusted data with software engineering best practices.
13
Customer spotlight:
Waterfall Asset
Management
As a leader in managing complex financial
assets such as asset-backed credit, whole
loans, real assets, and private equity,
Waterfall Asset Management understands
how critical data and analytics are in order to
discover compelling investment opportunities.
Implementing their data mesh architecture
enabled them to greatly improve the
productivity of their data operations as well as
investment performance for their clients.
Business impact of Prophecy
14x
4x
3 hours
improvement in data
operations productivity
faster time-to-insight
for trade desk analysts
to complete
Prophecy POC
The Challenge
Manual processes hurts time-to-value
Waterfall Asset Management, a global investment management firm, has been on a mission
to leverage the power of data insights to improve investment performance and mitigate risk
for clients. However, their data operations teams were hampered by manual processes and
a legacy ETL system that buckled under the velocity and variety of their data at scale. Hiring
more engineers was the bandaid fix, slowing workflows and impacting data quality.
14
The Past
Previous to implementing their data mesh architecture, data delivery was slow in their legacy ETL system that
required manual interventions by data engineering and was unable to scale to support the needs of the business
Missing investment opportunities
Less tech-savvy business users were either completely reliant on data engineering to access
and transform data or forced to perform manual data work rather than creating business
value. Their inability to deliver data-driven insights in a timely and accurate manner impacted
client services and portfolio performance.
The Solution
Empowering business users to do more with data
Waterfall chose Prophecy’s low-code data engineering platform on the Databricks Lakehouse
— giving business users an intuitive, self-serve tool to visually onboard and transform data
without needing to code or depend on engineering. Putting the power of data directly into
the hands of the users dramatically increased team productivity and customer satisfaction.
Open-source code and data governance built-in
As business users develop pipelines, Waterfall’s engineering team is able to access 100%
open-source Spark code — automatically generated by the platform. This enables data
15
engineers to ensure software development best practices for collaboration and ensure data
products are performant and reliable.
The Future
Prophecy’s low-code data platform serves as the backbone for their data mesh architecture enabling broader
participation in data product delivery and providing massive performance improvements
The Results
Data that moves at the speed of the market
Equipped with a low-code data platform that all data users can use, Waterfall has fasttracked data engineering workflows, reducing the time to prepare and transform data from
up to 2 weeks to about ½ a day — a 14x performance gain.
The benefits of faster data engineering have been felt downstream, as business users
(trade desk) are now receiving actionable data within 2 hours compared to day — a 4x
improvement — when they were dependent on the engineering team.
And with repeatable frameworks in place, data pipelines can be standardized and easily
shared across teams, reducing overall effort and operational errors.
16
“Prophecy removes the engineering
barriers that were blocking our ability to
help our clients drive investment value
with data. Now anyone on our team can go
from data to insights much faster, without
having to speak to a team of engineers.”
Shehzad Nabi
Chief Technology Officer
Streamlining the path to better investments
With the tools in place for all data users to accelerate data onboarding, the data operations
team has been able to shift their focus toward more strategic tasks like data governance in
an effort to address data quality problems before they reach the end user. With Prophecy in
place, it’s clear that Waterfall’s focus on leveraging data to improve investment decisions is
paying off and will continue to drive their business into the future.
Try Prophecy and realize
self-service success
Prophecy helps organizations avoid the common obstacles standing in the way of implementing
self-service analytics. Through a low-code platform, both technical and business data users are
able to easily build high quality data products in a collaborative and transparent way, enabling
faster decision-making, and greater business agility.
Get started with a free trial
You can create a free account and get full access to all platform features for 14 days.
Want more of a guided experience?
Request a demo and we’ll walk you through how Prophecy can empower your entire data
team with low-code ETL today.
Prophecy is a low-code data transformation platform that offers an easy-to-use visual interface to build, deploy, and manage data
pipelines with software engineering best practices. Prophecy is trusted by enterprises including multiple companies in the Fortune
50 where hundreds of engineers run thousands of ETL workloads every day. Prophecy is backed by some of the top VCs including
Insight Partners and SignalFire. Learn how Prophecy can help your data engineering in the cloud at www.prophecy.io.
Download