Ali open source distributed sub-database sub-table middleware MyCat from entry to abandonment

1. Query non-fragment fields

Routing results in Mycat are determined using sharding fields and sharding methods. For example, a Mycat sub-library scheme is shown below:

Fragment according to the ID field of tt_waybill table
The sharding method is a module with id value of 3. According to the module value, a shard in DB1, DB2, and DB3 is determined

If there is an ID field in the query criteria, the query will fall to a specific shard. Such as:

mysql>select * from tt_waybill where id = 12330;

At this point, Mycat computes the route result

12330% 3 = 0 — > DB1

And route the request to DB1 for execution. If there is no fragment field condition in the query condition, for example:

mysql>select * from tt_waybill where waybill_no =88661;

At this point, Mycat cannot calculate the route and sends it to all nodes to execute:

DB1 — > select * from tt_waybill where waybill_no =88661; DB2 — > select * from TT_waybill where waybill_no =88661; DB3 — > select * from tt_waybill where waybill_no =88661;

If the fragment field is highly selective, it is also a common query dimension of services. Generally, only one or a very few DB nodes are matched (the result set is returned). There are only 3 DB nodes in the example, but the actual number of DB nodes in the application is much larger than this. If there are 50 DB nodes, then one query in the front end will become 50 queries in the MySQL database, which will greatly consume the resources of Mycat and MySQL database.

If you are designing to use Mycat with non-sharded field queries, consider giving up!

2. Paging sort

Let’s take a look at how Mycat handles paging operations. Suppose we have a Mycat partition scheme: a table with 30 data distributed across 3 shard DB

,1,2,3,4,10,11,12,13,14 DB1: [0] DB2:,6,7,8,9,16,17,18,19 [5] DB3:,21,22,23,24,25,26,27,28,29 [20]

(There is no query condition in the scenario of this example, so all queries are full sharding, so there is no assumed sharding field and sharding method of the table.)

When the application performs the following paging query

mysql>select * from table limit 2;

Mycat distributes the SQL request to each DB node for execution and receives the results from each DB node

DB1: [0,1] DB2: [5,6] DB3: [20,21]

But the result set Mycat returns to the application depends on which DB node returns results to Mycat first. If Mycat receives the DB1 node result set first, Mycat returns the result set of [0,1] to the application, and if Mycat receives the DB2 node result set first, Mycat returns the result set of [5,6] to the application. That is, the same SQL will return different results when executed on Mycat in the same case.

When paging is performed in Mycat, a sorting condition must be added to ensure that the result is correct. Let’s take a look at Mycat’s sorting logic. If you add a sorting condition to the previous paging query (if the column name of the table data is ID)

mysql>select * from table order by id limit 2;

The processing logic of Mycat is shown below:

In the case of sorting conditions, Mycat receives the return results of each DB node and performs minimum heap calculation on them, and calculates the two smallest records [0,1] in all result sets and returns them to the application.

However, when there is an offset in the sort page, the processing logic is different. SQL = “query”;

Mysql >select * from table order by id limit 5,2;

If the sorting paging logic is followed, the result is as follows:

Mycat returns the data [10,11], [16,17] and [20,21] of each DB node to the application after minimal heap calculation. The result set is [10,11]. If limit 5,2 returns the result set [5,6]. If limit 5,2 returns the result set [10,11], the processing logic is incorrect.

So Mycat is a different set of logic when dealing with sort pages with offsets — rewriting SQL. The diagram below:

Mycat rewrites SQL statements with limit m and n to limit 0, m+n to ensure logical correctness of query results. So, the SQL statement Mycat sends to the back-end DB is

Mysql >select * from table order by id limit 0,7;

The result set returned by each DB to Mycat is

,1,2,3,4,10,11 DB1: [0] DB2:,6,7,8,9,16,17 [5] DB3:,21,22,23,24,25,26 [20]

The minimum heap calculation yields the minimum sequence [0,1,2,3,4,5,6], and then returns two results [5,6] with an offset of 5.

Although Mycat returns the correct result, close examination reveals that the processing logic for this type of operation is extremely resource consuming (wasteful). The application requires 2 result sets, and the number of results to be processed in Mycat is 21. In other words, for full shard limit m, n operation with T DB nodes, Mycat needs to process (m+n)* T data. For example, if there are 50 DB nodes and you want to execute limit 1000 and 10, then the amount of data processed by Mycat is 50500, and the result set is 10. When the offset is larger, the consumption of memory and CPU resources increases tens of times.

If you are designing for paging sorting with Mycat, consider abandoning it!

3. JOIN any table

Let’s take a look at the scenario in a JOIN in a single library. Suppose you have two tables player and TEAM in a single library, and the team_id field in the Player table is associated with the ID field in the Team table. The operation scenario is as follows:

The SQL for the JOIN operation is as follows

mysql>select p_name,t_name from player p, team t where p.no = 3 and p.team_id = t.id;

The result can be queried at this point

p_name	t_name
Wade	Heat

If the data of the two tables are separated into databases, the associated data may be distributed on different DB nodes, as shown in the following figure:

This SQL does not result in the individual shard DB, which means that Mycat cannot query the correct result set.

If you want to use Mycat to JOIN the table, make sure that the associated fields of the two tables have the same data distribution, otherwise please consider not!

Distributed transactions

Mycat does not implement XA transactions according to the two-phase commit protocol, but only ensures the consistency of data in the prepare phase. The implementation process is as follows:

Mycat identifies the connection as non-automatic when the application starts a transaction, such as front-end execution

mysql>begin;

Mycat does not immediately send the command to the DB node. When the SQL is delivered later, Mycat obtains non-automatically committed connections from the connection pool and executes them.

Mycat waits for the result of each node. If the operation succeeds, Mycat identifies the connection as Prepare Ready. If the operation fails on one node, Mycat identifies the connection as Rollback.

After the execution is complete, Mycat waits for the front-end to send commit or rollback commands. When sending the commit command, Mycat checks whether the current connection is in the Prepare Ready state. If so, it sends the commit command to each DB node.

However, consistency is not guaranteed at this stage. If one DB node fails at commit time and the other DB nodes succeed, Mycat will wait for the failed DB node to return the result. Only after receiving the successful execution results of all DB nodes, Mycat will return the successful package to the front end. In this case, Mycat can only wait until TIMEOUT, resulting in transaction consistency destruction.

If there is a distributed transaction in Mycat design, we should first see if we need to ensure strong transaction consistency, otherwise please consider giving up!

Stay tuned for more sharing

Ali open source distributed sub-database sub-table middleware MyCat from entry to abandonment

1. Query non-fragment fields

2. Paging sort

3. JOIN any table

Distributed transactions

Related Posts

Java heap, stack, in the constant pool concepts such as parsing | Java Debug notes

JDK11 | article 7: ZGC garbage collector

The underlying principles of concurrent programming and locking are worth a look