This article is based on the online sharing of dBAPlus community 167

 




In 2018, an interesting thing happened in the database world: Microsoft offered Support for Spark and Hadoop in the SQL Server 2019 preview.

Reference links:

https://cloudblogs.microsoft.com/sqlserver/2018/09/24/sql-server-2019-preview-combines-sql-server-and-apache-spark-to-cr eate-a-unified-data-platform/

 

We know SQL Server is a technically and commercially successful product, and it’s surprising that Microsoft has chosen to embrace Spark’s big data ecosystem this time around. Several domestic products are not lagging behind, Ali Cloud DRDS, Tencent Cloud TDSQL have also launched their own integration with Spark products.

 

Today we are going to talk about how Spark can bring new value to the database under the age-old topic of database.

 

First, the deficiency of traditional database

 

Needless to say, MySQL is the most widely used database in Internet enterprises. But MySQL focuses on OLTP capabilities and is not good at complex analytical queries. Why do you say that?

 

This is due to the overall system design of MySQL, which was designed from the very beginning for each request to be handled by a single thread. The reason for this design is that OLTP queries are mostly simple, and SELECT is more of a point-search than a single thread can handle.

 

In later improvements, MySQL has added thread pools, high and low priorities, and so on, but it has not changed its essence: one query per thread.

 

 

For such an architecture, even adding machine configurations does not significantly improve OLAP query performance because you cannot take advantage of the multi-core parallelism capabilities.

 

There are many distributed solutions on MySQL that support horizontal splitting, which allows data to be distributed evenly across multiple nodes to achieve Scale Out capability. Taking Ali Cloud DRDS (distributed relational database) as an example, DRDS supports distributed transactions, smooth expansion, read/write separation, permission management and other features. The architecture is shown in the following figure:

 

By properly choosing Sharding mode, we can handle some queries such as aggregate Join better in this architecture. However, if the data volume is large, such as an ad-hoc query that is difficult to index, this architecture is also inadequate.

 

2. Solve the OLAP problem

 

There are two general approaches to OLAP. The most common idea is to build a data warehouse by copying data in addition to the business repository. Most data stores adopt a distributed MPP architecture. MPP is also known as Massively Parallel Processing, which is a Massively Parallel Processing method that uses multiple nodes to speed up the computation of complex queries.On the other hand, data warehouse mostly uses share-nothing architecture, and the data is fragmented and stored on each node:

 

 

As opposed to the above idea, sometimes referred to as ROLAP, an alternative is to go beyond the query itself: MOLAP speeds up the query by pre-modeling the data. For example, we usually take the time dimension, add several business dimensions, and pre-aggregate their various combinations.

 

 

Here’s an example:

 

 

Eight combinations are enumerated from the three dimensions (year, item and City), and the aggregation of these eight ways is calculated in advance, which is sometimes vividly called Data Cube.

If the user query matches the pre-model, it only needs to do the calculation based on the previous aggregated results, which obviously reduces the cost of the query. A good example of MOLAP is Apache Kylin.

 

But it’s clear that the queries that MOLAP can compute are strictly limited by how it can be modeled, making it a much higher bar. So overall, it is not as widely used as ROLAP.

 

Spark and Spark SQL

 

Spark is the most popular big data framework. It supports programming interfaces, declarative (SQL) interfaces, batch tasks, and streaming computing tasks. It is written in Scala and supports programming interfaces for Java, Scala, Python, and other languages.

 

Although We often refer to Spark as part of the Hadoop ecosystem, Spark is not dependent on Hadoop because Spark does not contain storage and is usually deployed with HDFS.

 

Spark’s design is more advanced than MapReduce’s. Why do you say so?

 

As a review of the MapReduce model, MapReduce processes complex tasks into multiple MR stages. Each Stage writes intermediate results to HDFS, which are then read by the next Stage. The purpose of persistent intermediate results is to achieve worker-level failure tolerance, but this cost is too large.

 

In the Spark RDD model, another failure tolerance scheme is adopted.

 

Spark assumes that all calculations are deterministic, so it is possible to recover lost partitions by recalculating, avoiding HDFS writing and keeping data in memory, which greatly improves performance.

 

RDD stands for “elastic distributed data set”. “Distributed” means that each RDD contains multiple partitions and is distributed on multiple nodes. “Elastic” means that each partition can be easily restored — as long as its dependent parties remain, it can be reworked.

 

In Spark’s query plan package, operators such as filter and map can be executed in pipeline mode within partitions. However, other operators, such as Join, usually need to Shuffle the data on the left and right sides according to the Join Key before joining. As a result, the Join must wait for all the data on the left and right sides to be completed before continuing. This point is called a pipeline-breaker. In the execution plan, there are inevitably several pipeline-breakers, so there are multiple stages, similar to MR.

 

As the query executes, we compute the dependencies of stages from left to right. For example, in the example above, we execute Stage 1, Stage 2 (which can be run in parallel), and Stage 3. Stages are executed internally in pipeline mode.

 

Spark SQL is a secondary encapsulation of the Spark RDD API. The following is an example of a Native API:

 

 

You can also type SQL directly, which is the difference between Parser and SQL.

4. Why Spark?

 

As mentioned earlier, an architecture like MySQL is not enough to meet our need for complex queries. In the spirit of seeing through appearances, let’s take a fresh look at the various DBMSS and the way big data is implemented, as follows:

 

  • Standalone execution: represented by MySQL, this is obviously not enough;

  • Single-machine parallel (SMP) : PostgreSQL as a representative, does not have Scale Out capability, can also be excluded;

  • Distributed parallelism (MPP) : Represented by data warehouses and various big data frameworks.

 

According to the worker-level failure tolerance, MPP can be further divided into two types: distributed parallel processing and batch processing. The performance difference between the two is very small, and more depends on other aspects, such as whether the business needs to do hourly queries, whether the ecosystem is perfect, and so on.

 

 

As you can see from the table above, Spark SQL is adequate for our requirements. The reason why Hive was not chosen, there are some other technical, non-technical considerations, not necessarily absolute, here will not lead to war.

 

Five, DRDS HTAP practice

 

Fireworks engine is introduced into Ali Cloud DRDS read-only instance, which is our customized distributed MPP engine based on Apache Spark.

 

From the interface, the calculation is transparent to the user, the execution engine automatically according to the user input SQL to determine whether to carry out distributed execution, if the need for distributed execution, the execution plan will be sent to the Fireworks cluster, just wait for the completion of the calculation.

 

Instead of introducing new storage, DRDS HTAP leveragesthe database’s own standby repository. Different from data pipelines such as ETL, binlog synchronization between database master and standby is very fast, usually at the millisecond level, which is enough to meet the real-time requirements of most queries. Data from the standby database is also used to ensure that the master database does not add additional burden and affect services.

 

 

Q: If the Optimizer layer receives complex queries and needs to be forwarded to Spark for calculation, is it resolved into Spark logicplan? How do you connect to Spark?

 

A: We currently use the Spark SQL interface directly, as there is no need to hack internally. One possible consequence of this is that Spark’s optimizer will be optimized again. However, our execution plan is very specific, and Spark optimizes as expected.

 

 

https://m.qlchat.com/topic/details-listening?topicId=2000002309168191&tracePage=liveCenter