Materialize: Real-Time Data Processing Overview

Materialize Overview Materialize: the fastest and simplest way to build with real-time data Data is the foundation for decision-making in the - Forgoing Features: if developers choose modern era. Whether it is machine learning analytical or indexing databases that provide models, fraud detection algorithms, cart fast results for frequently executed queries, abandonment alerts or customer 360 dashboards they will be forced to sacrifice features such as - businesses need to harness real-time data support for standard SQL or JOINs (as these effectively to create actionable intelligence. analytical databases are optimized for wide Batch processes that run at scheduled intervals tables, with columnar indexes, and often cannot generate correct and timely insights and require denormalized data for low latency developers need a fresh approach to build results) event-driven applications. - Compromising Cost: if developers choose to Challenges while building with real-time data build exactly what’s needed, thereby choosing speed and features, they compromise on costs While building with real-time data, developers as custom microservices development on top of face a pick two-of-three compromise between Kafka pipelines requires heavy investment in speed (also known as latency), features and cost. terms of engineering man-hours and resources; not to forget, the opportunity cost of building a - Sacrificing Speed: if developers choose OLAP custom solution vs. investing in databases (or Data warehouses) that allow them revenue-generating activities as well. to manage complex queries in SQL, thereby prioritizing features and cost, they tend to Developers need a better way to build sacrifice latency as these solutions are designed applications with real-time data, a simpler and to work with batch data. faster alternative, that doesn’t involve these tradeoffs. The Materialize platform was created to solve these problems. MATERIALIZE.COM 1 Materialize Overview What is Materialize? Materialize is a source-available streaming database, written in Rust that maintains the results of a SQL query (a materialized view) in-memory and provides correct answers even as the underlying data changes. Materialize uses SQL as its interface and was built with emphasis on comprehensive SQL Materialize is a new kind of database that gives you the power of PostgreSQL, stream processing capabilities of Apache Flink and the speed of Redis. surface area, correctness, and efficiency. Materialize supports a variety of data sources and sinks, making it easy to integrate into an existing data ecosystem. Why use a materialized view? Traditional approaches to working with data in Materialize solves this problem by allowing users to motion involve (a) collecting the real-time event define a materialized view on top of incoming data, feed as a Kafka topic (b) using a microservice to and then incrementally updating them as new data write these events into a database table and (c) arrives. This is a key distinction - rather than using a dashboard in a BI tool to query the recalculating the answer each time the materialized database table. The problem with this approach is view is queried, Materialize incrementally maintains that the database needs to perform expensive the view and gives you the correct answer stored aggregations at the time of query (e.g. GROUP BY, in-memory. What’s more, Materialize can do this in ORDER BY, JOINs, etc.) As more records are added, the presence of complex JOINs and arbitrary query latency tends to increase. inserts, updates or deletes in the underlying data. One alternative to overcome this query latency The end result: your applications can query problem is a materialized view, where query results Materialize and get blazing fast results which often are precomputed and stored for fast read-access. falls in the millisecond latency range. With these However, materialized views also need to be incrementally updated materialized views, users periodically refreshed, especially when the can get real-time insights while maintaining underlying data has changed; else query results will relevant historical context. be inaccurate as that’s based on stale data. MATERIALIZE.COM 2 Materialize Overview How does Materialize work? Materialize is built on top of two open source Each materialized view contains at least one index projects, Timely Dataflow and Differential Dataflow, that maintains the embedded query’s result which were created by Materialize co-founder in-memory; the continually updated indexes are Frank McSherry. Timely Dataflow is a horizontally known as “arrangements”. Arrangements let scalable low-latency dataflow engine, on which Materialize perform more sophisticated Differential Dataflow builds relational dataflow aggregations like JOINs more quickly. When reading operators. from a materialized view, Materialize simply returns the dataflow’s current result set. When you create a materialized view in Materialize, the system creates a dataflow. A dataflow is a Sinks allow users to stream data out of Materialize topology of ongoing transformations that tells and represent a connection to an external stream, Materialize how the final query output should be. such as Kafka. When a user defines a sink over a The sequence or order isn’t as important, as long as materialized view or source, Materialize the dataflow is maintaining the correctness of the automatically generates the required schema and final state. Essentially, every SQL view or query writes down the stream of changes to that view or gets converted into a data-parallel dataflow. Once source. In effect, Materialize sinks act as change executed, the dataflow computes and stores the data capture (CDC) producers for the given source result of the SQL query in-memory, polls the source or view and allow users to stream the output of a for updates, and incrementally updates the query materialized view to a Kafka topic. Sinks can be results when new data arrives. used to power an event-driven application, or an alerting / notification service. MATERIALIZE.COM 3 Materialize Overview Why is Materialize uniquely suited for building with real-time data? Benefit #1 - Standard SQL Materialize can share and reuse state effectively, which means users can maintain Materialize conforms to ANSI-92 SQL standards. joins using fewer resources. With Materialize, SQL is an industry standard for managing data users get all the power of a full stream and is a comprehensive and mature language. As processor like Apache Flink, with the added a declarative language, it allows developers to flexibility of joins based on standard SQL, like ask questions of their data and explore their data any PostgreSQL database. in all its fullness and complexity. SQL skills are abundant in the market, and with a low barrier to Benefit #3 - High Performance entry, SQL is the fastest and easiest way to build applications on data. Materialize is built on top of Timely and Differential Dataflow, open source projects with close to a decade of development that have “Working with Materialize has been an incredibly seamless process as we can continue to write real-time SQL, exactly the same way as we already are in Snowflake with batch, so it was a much lower barrier to entry.” already been deployed by Fortune 500 - Emily Hawkins, Data Infrastructure Lead at Drizly incrementally maintains the result of the SQL businesses at global scale. When users create materialized views and issue queries, Materialize creates execution plans that are referred to as “dataflows”. On execution, a dataflow computes and query, as defined in the materialized view, in an in-memory result set. Dataflows run continually and poll the source for changes. Computations Benefit #2 - Complex JOINs Materialize supports all manner of JOINs - Inner, Left Outer, Right Outer, Full Outer, Cross, Lateral and N-way joins. This means that developers do not have to pre-process their data and denormalize into wide tables; they can use standard SQL JOINs in Materialize to combine are triggered by the arrival of new data and do work proportional only to the necessary changes rather than to the total data volume. This provides the scalability and throughput of a stream processor like Apache Flink, the resource efficiency of any PostgreSQL relational database and the performance of an in-memory database like Redis. data in motion (streaming) and data at rest (batch) with no restriction. MATERIALIZE.COM 4 Materialize Overview Benefit #4 - Correctness over Eventual Benefit #6 - Interoperability Consistency Materialize is compatible with PostgreSQL so In Materialize, all updates bear a logical you have access to the entire suite of software timestamp, an unambiguous indication of when products and services available in the Postgres the update “takes place”. These logical ecosystem. timestamps allow Materialize to provide deterministic, always consistent and correct query results, without requiring fine-grained Materializes connects to: – Ingress sources* such as Kafka and coordination between workers, nor isolation Kinesis, it can connect directly to between their work items. PostgreSQL or to other databases via CDC (Change data capture) and it can All operators preserve this logical timestamp in their output and thereby maintain a consistent ingest batch data from Amazon S3 – view of the results. Query results are always Kafka, streaming connections to all correct with respect to this timestamp, and never out of sync with one another, even though their Egress can be pushed out as sinks* to standard PostgreSQL libraries and drivers – Egress can be pulled (queried) out into execution is asynchronous and across multiple existing data tools that have a parallel workers. Postgres-compatible connector. Benefit #5 - Efficient Resource Consumption *For a full list of sources and sink, see this documentation link. Materialize allows you to sustain a small hardware footprint by maintaining indexed summaries of relevant data history and then updating the summaries incrementally as new data is streamed in. By contrast, legacy systems store data at the lowest level of granularity, requiring increasing investments in hardware as the data volume grows. Materialize is written in Rust, which is well suited for the performance and correctness needs of data-intensive computing, and does not have to contend with Garbage Collection. “With Materialize, everything speaks Postgres. Everything has a Postgres database adapter, and all it cares about is that you have some stream of data. It’s super easy to just drop into pretty much any architecture. I think for us that was actually the biggest benefit of Materialize, that it speaks Kafka on all these different streams but it also speaks Postgres.” - Michael Francis, Software Engineer at SproutFi MATERIALIZE.COM 5 Materialize Overview Materialize Use Cases Application infrastructure A materialized view, sometimes called a “materialized cache,” is an approach to precomputing the results of a query and storing it for fast read access. In contrast with a regular database query, which does all of its work at read time, a materialized view does nearly all of its work at write time. This is why materialized views can offer highly performant reads. A standard way of building a materialized cache is to capture the changelog of a database and process it as a stream of events. This creates multiple distributed materializations, each best suited to each application’s query patterns and results in running multiple systems — the relational database, Kafka clusters, connectors, a stream processor like Apache Flink, and another key-value data store like Redis. Materialize simplifies this architecture, while providing the power of standard SQL, flexible JOINs and incremental computation to solve the problem of cache invalidation. Stream processing, exploration and enrichment A streaming ETL pipeline, sometimes called a “streaming data pipeline,” is a set of software services that ingests events, transforms them, and loads them into destination storage systems. Streaming ETL involves not only processing data in motion, but often requires changes to the data as it is transferred (e.g., stripping personal identifiable information or something more complex like enriching events by joining them with data from another system). Materialize enables stream processing, exploration and enrichment as it simplifies connecting to arbitrary sources and sinks, manipulating in-flight data with standard SQL and combining real-time datasets with streaming JOINs. Real-time analytics Modern businesses frequently use real-time analytics such as customer 360 views, gaming leaderboards, supply chain route optimizations, inventory tracking alerts, fraud detection, real-time dashboards, and more. Analytics engineers are tasked with creating reliable data pipelines that support the analysis, and Data Analysts are tasked with building the reports and gathering insights from this data. Materialize enables Real-time Analytics with the dbt adapter (which allows Analytics engineers to create materialized views and execute queries from their existing dbt projects), integrations with visualization tools such as Metabase, Hex*, Looker* and Tableau* (*expected later in 2022) that facilitate the creation of reports and dashboards, or the ability to send data from Materialize to an external sink that can power event-driven analytics. MATERIALIZE.COM 6 Materialize Overview Deploying Materialize: Cloud or Self-hosted Materialize is multi-threaded and executes as a single, source-available binary process called materialized. While it is only officially supported on Linux X86-64 platforms, you can build and install Materialize on macOS, FreeBSD, and on most other operating systems. Developers can also choose to leverage Materialize Cloud, a fully hosted SaaS solution that automates administrative tasks like hardware provisioning, database setup, upgrades, and backups. “With Materialize, we can take the same analytics that used to be embedded in our reports, and use them to let people know as soon as something becomes an issue, rather than them needing to find any report or a dashboard. This allows us to move from manual to automated decision making.” - Josh Ahrenberg, Director of Engineering at Datalot Materialize can be deployed on a single computer or scaled out across multiple computers* (*expected later in 2022), because it is designed as a parallel data computing platform which connects low-latency operations into a larger reactive computation as needed. Want to learn more? Materialize blends the scalability of PostgreSQL, the features of stream processors such as Apache Flink with the speed of in-memory databases like Redis - all with a familiar SQL interface and a strong consistency model that works the way you would expect it to work. We encourage you to download and try Materialize or sign up for a free trial of Materialize Cloud. Join the discussion at http://materialize.com/chat MATERIALIZE.COM 6

Materialize: Real-Time Data Processing Overview

Related documents

Products

Support

Materialize: Real-Time Data Processing Overview

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib