Uploaded by Игорь С

materialize-overview-whitepaper-2022

advertisement
Materialize Overview
Materialize: the fastest and
simplest way to build with
real-time data
Data is the foundation for decision-making in the
- Forgoing Features: if developers choose
modern era. Whether it is machine learning
analytical or indexing databases that provide
models, fraud detection algorithms, cart
fast results for frequently executed queries,
abandonment alerts or customer 360 dashboards
they will be forced to sacrifice features such as
- businesses need to harness real-time data
support for standard SQL or JOINs (as these
effectively to create actionable intelligence.
analytical databases are optimized for wide
Batch processes that run at scheduled intervals
tables, with columnar indexes, and often
cannot generate correct and timely insights and
require denormalized data for low latency
developers need a fresh approach to build
results)
event-driven applications.
- Compromising Cost: if developers choose to
Challenges while building with real-time data
build exactly what’s needed, thereby choosing
speed and features, they compromise on costs
While building with real-time data, developers
as custom microservices development on top of
face a pick two-of-three compromise between
Kafka pipelines requires heavy investment in
speed (also known as latency), features and cost.
terms of engineering man-hours and resources;
not to forget, the opportunity cost of building a
- Sacrificing Speed: if developers choose OLAP
custom solution vs. investing in
databases (or Data warehouses) that allow them
revenue-generating activities as well.
to manage complex queries in SQL, thereby
prioritizing features and cost, they tend to
Developers need a better way to build
sacrifice latency as these solutions are designed
applications with real-time data, a simpler and
to work with batch data.
faster alternative, that doesn’t involve these
tradeoffs. The Materialize platform was created
to solve these problems.
MATERIALIZE.COM
1
Materialize Overview
What is Materialize?
Materialize is a source-available streaming
database, written in Rust that maintains the results
of a SQL query (a materialized view) in-memory and
provides correct answers even as the underlying
data changes. Materialize uses SQL as its interface
and was built with emphasis on comprehensive SQL
Materialize is a new kind of
database that gives you the power
of PostgreSQL, stream processing
capabilities of Apache Flink and the
speed of Redis.
surface area, correctness, and efficiency.
Materialize supports a variety of data sources and
sinks, making it easy to integrate into an existing
data ecosystem.
Why use a materialized view?
Traditional approaches to working with data in
Materialize solves this problem by allowing users to
motion involve (a) collecting the real-time event
define a materialized view on top of incoming data,
feed as a Kafka topic (b) using a microservice to
and then incrementally updating them as new data
write these events into a database table and (c)
arrives. This is a key distinction - rather than
using a dashboard in a BI tool to query the
recalculating the answer each time the materialized
database table. The problem with this approach is
view is queried, Materialize incrementally maintains
that the database needs to perform expensive
the view and gives you the correct answer stored
aggregations at the time of query (e.g. GROUP BY,
in-memory. What’s more, Materialize can do this in
ORDER BY, JOINs, etc.) As more records are added,
the presence of complex JOINs and arbitrary
query latency tends to increase.
inserts, updates or deletes in the underlying data.
One alternative to overcome this query latency
The end result: your applications can query
problem is a materialized view, where query results
Materialize and get blazing fast results which often
are precomputed and stored for fast read-access.
falls in the millisecond latency range. With these
However, materialized views also need to be
incrementally updated materialized views, users
periodically refreshed, especially when the
can get real-time insights while maintaining
underlying data has changed; else query results will
relevant historical context.
be inaccurate as that’s based on stale data.
MATERIALIZE.COM
2
Materialize Overview
How does Materialize work?
Materialize is built on top of two open source
Each materialized view contains at least one index
projects, Timely Dataflow and Differential Dataflow,
that maintains the embedded query’s result
which were created by Materialize co-founder
in-memory; the continually updated indexes are
Frank McSherry. Timely Dataflow is a horizontally
known as “arrangements”. Arrangements let
scalable low-latency dataflow engine, on which
Materialize perform more sophisticated
Differential Dataflow builds relational dataflow
aggregations like JOINs more quickly. When reading
operators.
from a materialized view, Materialize simply returns
the dataflow’s current result set.
When you create a materialized view in Materialize,
the system creates a dataflow. A dataflow is a
Sinks allow users to stream data out of Materialize
topology of ongoing transformations that tells
and represent a connection to an external stream,
Materialize how the final query output should be.
such as Kafka. When a user defines a sink over a
The sequence or order isn’t as important, as long as
materialized view or source, Materialize
the dataflow is maintaining the correctness of the
automatically generates the required schema and
final state. Essentially, every SQL view or query
writes down the stream of changes to that view or
gets converted into a data-parallel dataflow. Once
source. In effect, Materialize sinks act as change
executed, the dataflow computes and stores the
data capture (CDC) producers for the given source
result of the SQL query in-memory, polls the source
or view and allow users to stream the output of a
for updates, and incrementally updates the query
materialized view to a Kafka topic. Sinks can be
results when new data arrives.
used to power an event-driven application, or an
alerting / notification service.
MATERIALIZE.COM
3
Materialize Overview
Why is Materialize uniquely suited for building with real-time data?
Benefit #1 - Standard SQL
Materialize can share and reuse state
effectively, which means users can maintain
Materialize conforms to ANSI-92 SQL standards.
joins using fewer resources. With Materialize,
SQL is an industry standard for managing data
users get all the power of a full stream
and is a comprehensive and mature language. As
processor like Apache Flink, with the added
a declarative language, it allows developers to
flexibility of joins based on standard SQL, like
ask questions of their data and explore their data
any PostgreSQL database.
in all its fullness and complexity. SQL skills are
abundant in the market, and with a low barrier to
Benefit #3 - High Performance
entry, SQL is the fastest and easiest way to build
applications on data.
Materialize is built on top of Timely and
Differential Dataflow, open source projects with
close to a decade of development that have
“Working with Materialize has been
an incredibly seamless process as
we can continue to write real-time
SQL, exactly the same way as we
already are in Snowflake with batch,
so it was a much lower barrier to
entry.”
already been deployed by Fortune 500
- Emily Hawkins, Data Infrastructure
Lead at Drizly
incrementally maintains the result of the SQL
businesses at global scale. When users create
materialized views and issue queries,
Materialize creates execution plans that are
referred to as “dataflows”.
On execution, a dataflow computes and
query, as defined in the materialized view, in an
in-memory result set. Dataflows run continually
and poll the source for changes. Computations
Benefit #2 - Complex JOINs
Materialize supports all manner of JOINs - Inner,
Left Outer, Right Outer, Full Outer, Cross, Lateral
and N-way joins. This means that developers do
not have to pre-process their data and
denormalize into wide tables; they can use
standard SQL JOINs in Materialize to combine
are triggered by the arrival of new data and do
work proportional only to the necessary
changes rather than to the total data volume.
This provides the scalability and throughput of
a stream processor like Apache Flink, the
resource efficiency of any PostgreSQL
relational database and the performance of an
in-memory database like Redis.
data in motion (streaming) and data at rest
(batch) with no restriction.
MATERIALIZE.COM
4
Materialize Overview
Benefit #4 - Correctness over Eventual
Benefit #6 - Interoperability
Consistency
Materialize is compatible with PostgreSQL so
In Materialize, all updates bear a logical
you have access to the entire suite of software
timestamp, an unambiguous indication of when
products and services available in the Postgres
the update “takes place”. These logical
ecosystem.
timestamps allow Materialize to provide
deterministic, always consistent and correct
query results, without requiring fine-grained
Materializes connects to:
–
Ingress sources* such as Kafka and
coordination between workers, nor isolation
Kinesis, it can connect directly to
between their work items.
PostgreSQL or to other databases via
CDC (Change data capture) and it can
All operators preserve this logical timestamp in
their output and thereby maintain a consistent
ingest batch data from Amazon S3
–
view of the results. Query results are always
Kafka, streaming connections to all
correct with respect to this timestamp, and never
out of sync with one another, even though their
Egress can be pushed out as sinks* to
standard PostgreSQL libraries and drivers
–
Egress can be pulled (queried) out into
execution is asynchronous and across multiple
existing data tools that have a
parallel workers.
Postgres-compatible connector.
Benefit #5 - Efficient Resource Consumption
*For a full list of sources and sink, see this
documentation link.
Materialize allows you to sustain a small
hardware footprint by maintaining indexed
summaries of relevant data history and then
updating the summaries incrementally as new
data is streamed in. By contrast, legacy systems
store data at the lowest level of granularity,
requiring increasing investments in hardware as
the data volume grows.
Materialize is written in Rust, which is well suited
for the performance and correctness needs of
data-intensive computing, and does not have to
contend with Garbage Collection.
“With Materialize, everything speaks
Postgres. Everything has a Postgres
database adapter, and all it cares
about is that you have some stream
of data. It’s super easy to just drop
into pretty much any architecture. I
think for us that was actually the
biggest benefit of Materialize, that it
speaks Kafka on all these different
streams but it also speaks
Postgres.”
- Michael Francis, Software
Engineer at SproutFi
MATERIALIZE.COM
5
Materialize Overview
Materialize Use Cases
Application infrastructure
A materialized view, sometimes called a “materialized cache,” is an approach to precomputing the results
of a query and storing it for fast read access. In contrast with a regular database query, which does all of
its work at read time, a materialized view does nearly all of its work at write time. This is why materialized
views can offer highly performant reads.
A standard way of building a materialized cache is to capture the changelog of a database and process it
as a stream of events. This creates multiple distributed materializations, each best suited to each
application’s query patterns and results in running multiple systems — the relational database, Kafka
clusters, connectors, a stream processor like Apache Flink, and another key-value data store like Redis.
Materialize simplifies this architecture, while providing the power of standard SQL, flexible JOINs and
incremental computation to solve the problem of cache invalidation.
Stream processing, exploration and enrichment
A streaming ETL pipeline, sometimes called a “streaming data pipeline,” is a set of software services that
ingests events, transforms them, and loads them into destination storage systems. Streaming ETL
involves not only processing data in motion, but often requires changes to the data as it is transferred
(e.g., stripping personal identifiable information or something more complex like enriching events by
joining them with data from another system).
Materialize enables stream processing, exploration and enrichment as it simplifies connecting to arbitrary
sources and sinks, manipulating in-flight data with standard SQL and combining real-time datasets with
streaming JOINs.
Real-time analytics
Modern businesses frequently use real-time analytics such as customer 360 views, gaming
leaderboards, supply chain route optimizations, inventory tracking alerts, fraud detection, real-time
dashboards, and more. Analytics engineers are tasked with creating reliable data pipelines that support
the analysis, and Data Analysts are tasked with building the reports and gathering insights from this
data.
Materialize enables Real-time Analytics with the dbt adapter (which allows Analytics engineers to create
materialized views and execute queries from their existing dbt projects), integrations with visualization
tools such as Metabase, Hex*, Looker* and Tableau* (*expected later in 2022) that facilitate the creation
of reports and dashboards, or the ability to send data from Materialize to an external sink that can power
event-driven analytics.
MATERIALIZE.COM
6
Materialize Overview
Deploying Materialize: Cloud or Self-hosted
Materialize is multi-threaded and executes as a
single, source-available binary process called
materialized. While it is only officially supported
on Linux X86-64 platforms, you can build and
install Materialize on macOS, FreeBSD, and on
most other operating systems.
Developers can also choose to leverage
Materialize Cloud, a fully hosted SaaS solution
that automates administrative tasks like
hardware provisioning, database setup,
upgrades, and backups.
“With Materialize, we can take the
same analytics that used to be
embedded in our reports, and use
them to let people know as soon as
something becomes an issue, rather
than them needing to find any
report or a dashboard. This allows
us to move from manual to
automated decision making.”
- Josh Ahrenberg, Director of
Engineering at Datalot
Materialize can be deployed on a single computer
or scaled out across multiple computers*
(*expected later in 2022), because it is designed
as a parallel data computing platform which
connects low-latency operations into a larger
reactive computation as needed.
Want to learn more?
Materialize blends the scalability of PostgreSQL, the features of stream processors such
as Apache Flink with the speed of in-memory databases like Redis - all with a familiar SQL
interface and a strong consistency model that works the way you would expect it to work.
We encourage you to download and try Materialize or sign up for a free trial of Materialize
Cloud.
Join the discussion at http://materialize.com/chat
MATERIALIZE.COM
6
Download