1, the
Streaming SQL is running on a rapidly changing flow of data using the same declarative SQL used to write database queries.
And that’s useful because.
-
Data is often more valuable when you can act quickly
-
Existing tools for gaining real-time insight from data streams are too complex.
The “declarative” nature of SQL plays an important role in addressing the second point, because it allows users to focus on what they want while the underlying engine worries about how to get there.
In the real world, streaming SQL is used.
-
Enable new internal and customer-facing insights, automation, and applications
-
Enhance the value of business intelligence data by providing a single up-to-date source of truth for key metrics
-
Simplify microservices by replacing code for data coordination and transformation
2. What is streaming SQL?
Let’s first specify what we mean by stream processing and SQL.
Flow (Event flow)
Flows refer to message mediations such as Kafka, Kinesis, or Pulsar that process data as a continuous stream of events or messages.
Event streams handle everything from transactions to user actions on websites or mobile applications, iot sensor data, metrics on servers, and even activity on traditional databases through Change Data Capture.
SQL
In the context of flows, SQL provides users with a declarative language for.
-
CREATE a VIEW to connect, filter, and group raw data in the stream into more useful outputs (CREATE MATERIALIZED VIEW)
-
SELECT data from sources and views (SELECT)
Note: The CREATE MATERIALIZED VIEW command is the core concept of streaming SQL. It comes from Databases, where it is used to evaluate the view ahead of time in case the data changes. In streaming media, data is always changing, so queries are often more useful in maintaining materialized views.
Other common SQL verbs such as INSERT, UPDATE, and DELETE also play a role in streaming SQL, but in this article we will focus on the core concepts of reading, joining/filtering/transforming these streams, and making their output queryable or written to a new stream.
3. The difference between SQL on a stream and a database
Once you try using SQL on streams, some key differences become apparent.
Point-in-time query and continuous query
Running an SQL query on a traditional database returns a static set of results from a single point in time.
Take this question for example:
SELECT SUM(amount) as total_amount FROM invoices;
Copy the code
When you run this, the database engine scans all Invoices that exist at the time of the query and returns the sum of their Invoices.
With streaming SQL, you can run the exact query above and get a point-in-time answer. But you’re querying a rapidly changing data stream, and once you get the results, they’re probably out of date. In many cases, a continuously updated query (materialized view) is more useful in several ways, which we describe below.
To turn the query above into an materialized view, you write.
CREATE MATERIALIZED VIEW invoice_summary ASSELECT SUM(amount) as total_amount FROM invoices;
Copy the code
When you first create, the SQL engine will process the entire history of Invoice events it has access to, up to now, and then continue to update as new Invoice events come in.
4. Response time and lag
Traditional databases have the concept of query response time: you run a query, some time passes while the engine computes the results, and then you get a response.
In stream processing, the initial response time is just one factor when you first materialize a view. However, if our input events suddenly surge, there must be some kind of temporal penalty in the stream results. The penalty is time lag: how much time is the output behind the input?
Just like the response time of traditional databases, most end users don’t need to think about the time lag of streaming systems, but knowing that it exists helps to write and use streaming SQL in a way that avoids problems.
Different actions create work for the underlying engine
On the read side, a traditional database engine sits idle until it receives a query, then it plans and optimizes it, and starts working to provide the results. Once it replies to the result, it sits idle again until it receives another query. Sending queries creates work for the engine.
If you go back to the materialized view above, the new data from the stream creates work for the engine. In Materialize, this approach is achieved by incremental computation: the work done by updating the view is proportional to the incoming data, not to the complexity of the query. We don’t need to do a full rescan of the data to update the results.
This paradigm shift makes streaming SQL best suited for queries that repeatedly ask the same question (such as dashboards, reports, automation, most application code) rather than AD hoc queries.
Why is streaming SQL useful?
1. Data is often most valuable when it first appears
There are two reasons for this, one obvious and one not so obvious.
Faster data = faster decisions – the stock market is a clear example of this idea taken to an extreme.
But it is also suitable for SaaS enterprises, verticals like marketing, travel, events, and others that require quick decisions on rates and inventory, as well as retail and logistics where quick decisions can reduce inefficiencies, and so on.
The closer the data is to its source, the less chance it has of being misunderstood — every step from where it is created to where it is used increases the possibility of error, where the end user (human or machine) thinks it represents something that doesn’t exist. Time plays a role in this, forcing people to coordinate around the order of operations and the consistency of work. In this case, switching to streaming data isn’t because it’s faster, but because you no longer have to worry about timing.
2.SQL is a great way to gain insight from streaming data
Here is another example of a materialized view of streaming events.
CREATE MATERIALIZED VIEW experiments AS SELECT experiment_views.name, experiment_views.variant, COUNT(DISTINCT(experiment_views.user_id)) as unique_users, COUNT(DISTINCT(conversions.user_id)) as unique_conversions FROM experiment_views LEFT JOIN conversions ON conversions.user_id = experiment_views.user_id AND conversions.created_at > experiment_views.created_at GROUP BY experiment_views.name, experiment_views.variant;
Copy the code
-
SQL isn’t unique to stream processing — the meaning doesn’t change when data is moved from a stream to a database, so neither should the way we query.
-
Its declarative nature increases productivity – the developer has few optimization decisions to make, especially when compared to the same tasks in the code.
SQL has the added benefit of being a mature language, more than 30 years old, with an ecosystem of tools and education around it. This means that more developers can use streaming data and easily integrate it into the rest of their stack.
Use cases for streaming SQL
Today, anyone already using a message broker like Kafka can start using streaming SQL without much effort. In the future, as CDC software matures, this standard will be extended to “anyone with a database.” Here are some examples of using streaming SQL.
Business intelligence and analytics
When deciding “what is the best way to empower our internal team to make intelligent decisions from data”, streaming SQL is an option to consider, and its trade-offs make it better for some cases than others.
In many cases, a materialized view of primary source data done with streaming SQL is a simpler data pipeline. In addition to the benefits of real-time data, companies using this approach can sidestep the following issues.
-
Coordination of time interval and operation sequence in batch processing
-
An extended shutdown caused by an error that cannot be fixed or tested before the next batch runs
-
The dashboard loads slowly
-
Inconsistencies caused by caching and de-normalization
Micro service
Streaming SQL is used to replace code that does complex data coordination and transformation in microservices.
Event streams like Kafka are often first class citizens of microservices architecture. Engineers often find themselves building and maintaining complex applications that are consumed from Kafka. For example: applications read from event logs to generate insights and measurements of API usage for SaaS applications.
Any component in microservices that looks like a query can be replaced by streaming SQL.
Real-time application
If the value of your application depends on your ability to deliver updates and data in real time, streaming SQL may be an alternative to building an expensive or complex multi-component stack.
New ability
Real-time Analytics for users – Previously, only tech giants like LinkedIn and Google had the scale and engineering teams to build real-time Analytics for users (such as LinkedIn’s “Who viewed your profile” page or Google Analytics’ real-time dashboard). By reducing complexity, streaming SQL opens up the magic of real-time user analysis to more companies.
Business automation – Once you have streaming SQL for real-time dashboards, a natural progression is to start making automated decisions on the same data. (for example. If your e-commerce site sees a spike in traffic from a particular source, add a promotion to the home page).
6, summary
Materialize provides a streaming SQL implementation that is unique in two important respects.
In Materialize, you can write queries in POSTgres-compatible SQL. We think it’s worth the extra effort to build this system, because only with this level of SQL compatibility can you reap the benefits of integrating with existing tools and removing the user burden of advanced stream processing concepts.
Query engines use Differential Dataflow to maintain materialized views more efficiently because new data comes in.
PS: In case you can’t find this article, please click “like” to browse and find it