preface

In short, just starting out with MySQL; Rich and moderate business use Oracle, core business use Oracle, non-core business use cloud database; Insufficient funds, large business volume with MySQL + open source middle layer; Large business volume can only be researched.

Big data and relational databases are not the same direction, mainly used to store other types of data, orders, transactions and other data generally do not magnify data, more suitable for logging, browsing records and so on.

Best Practices – Technology selection

1. Jingdong // has become an Apache project

2.Mycat

Dangdang, sharding – JDBC

Support 1. Separate tables and libraries In the early stage, only separate libraries were supported. Later, separate tables were supported.

2. Read/write separation is not supported in early stage. 1.5 support.


The implementation principle is based on JDBC, in the form of lightweight JAR packages and program code deployment.


Code baijifeilong. Making. IO / 2018/11/26 /…


Reference juejin. Cn/post / 684490… www.infoq.cn/article/201…

Jingdong, Apache ShardingSphere

Supports 1. Table and library separation 2. Read and write separation


The implementation principle includes two parts: 1. One is the same function as Dangdang Sharding – JDBC 2. One part is proxy middleware


Jingdong Practice www.infoq.cn/article/1Qv…


1. At the beginning, I was the person in charge of Dangdang. 2. Finally, open source by JD

zhuanlan.zhihu.com/p/50335657


Reference shardingsphere.apache.org/document/cu… // Official Chinese document

Github.com/apache/incu… //github

Mycat

Implementation Principle Proxy middleware

Note: Personal open source.


Based on Ali open source Cobar.

shard

Use step 1. Open source software. Set which shard (3); // Suppose there are two nodes with a remainder of 0, i.e. the 0th node; The remainder is 1, which is the first node

2.3/2=1, so it’s the first node.

This shows which node to set. How to automatically locate a node? The above is how to hardcode write to death. If you want to automatically route to a node, use the configuration file method.

<sharding:inline-strategy id="databaseStrategy" sharding-column="user_id" algorithm-expression="ds$->{user_id % 2}"/> // Total two nodes, use the remainder algorithm. 1. The remainder is 0, which is 0 node 2. The remainder is 1, which is 1 node. // When writing, which node is it; When you read it, it's still the same node. // If you are using a data range, or a date range (which is essentially a data range), and also the node at the time of writing, the same node will still be read.Copy the code

How do I uniquely locate a table? Database name _ table name, the combination of the two uniquely located.

The name of the same table in different libraries can be the same or different.

Routing algorithm

According to the scope of

This parameter is suitable for sequential growth of data, for example: 1. Integer data useId 2. There is also a date field for time type data


Single libraries are 10 million so each 10 million is a single library.


Fragmentation – dimensional field query problem? 1. How do I know which library is the current logged-in userId? The last four bits of the userId are used for sharding. 2. How do I know which library orderId is currently in? The orderId field contains the last four digits of the userId, so we know which shard it is.

3. How do I know which database is used for the current time period? With the exception of userId and orderId, which do not contain the values of the shard fields, the solution of the entire cluster redundancy must be used.

4. Merchant ID dimension // Same as 3 agent ID dimension

hash

Hash strives for the remainder


What is key? A unique identifier field like userId.

Fragmentation/sub-table sub-library after the problem – multi-dimensional query

Dimensional query is to select the dimension field of the sub-table, what if you want to query by other fields? This application scenario is very common, such as the following application scenarios.

Solution – Combine fields

1. Fragment field The last four bits of the user ID

2. Composition of the orderId field: orderId+userId the last four digits of the orderId are used for indexing, and the last four digits of the userId are used for load balancing, that is, routing the current order to the same database server.

www.jianshu.com/p/df1d9dd1d…

3. How to know the date, merchant ID/agent ID is the shard/sub-table to which library? The other dimensions can only be implemented through overall cluster redundancy.

Solution – Entire cluster redundancy

Redundancy // Redundancy of the entire cluster, one dimension per cluster // Disadvantages: 1. Data consistency 2. Waste of disk space

Synchronously replicating data // Data synchronization is based on binlog. There are data consistency issues, which are inevitable and have to be weighed. // The standby database is suitable for reading and writing without high real-time requirements


How do you do that? Vertical segmentation relieves the pressure of the original single cluster, but it is still hard to buy. The original order model can no longer meet the business needs, so we designed a new unified order model. In order to simultaneously meet the needs of c-end users, B-end merchants, customer service and operation, we segmented them by user ID and merchant ID respectively. And synced to an operational library via PUMA, our internally developed MySQL Binlog real-time parsing service.

Tech.meituan.com/2016/11/18/… // For each dimension, there should be one more cluster redundancy.


Table instead of index m.blog.itpub.net/29254281/vi…


Cross-library queries query individual libraries separately and then group them together.


1. User table Mobile phone number Login query

2. Order table ID

date

Merchant ID ID of an agent


Route/Load Balancing ID/Key- Unique IDENTIFIER —— hashCode —– Remainder = Machine node

The core steps are exactly the same!


Reference developer.51cto.com/art/201812/… Tech.meituan.com/2016/11/18/…

Problems arising from the separation of libraries

SQL query and operation problems

1. Join join 2. Count (*) Number of statistics 3

That is, the simplest solution is to read data for multiple times and then combine all data for subsequent operations. The advantage is simple implementation, but the disadvantage is that the performance is a little lower. However, there are not many such scenarios, so the simpler the solution, the better.

Cross-library/distributed transaction issues

— Read and write separation —

background

Most Internet services have too many reads and too few writes. Database reads are often the first performance bottleneck.

1. Read // Read/write separation is to solve the read performance problem. 2

Read/write data // Read/write separation improves read/write ability Storing data // Separate tables and libraries improves storage capability because the amount of data stored in a single machine is limited

Implementation approach

The essence is to distinguish between write and read, and then route to write and read machines.

The solution

Client-side implementation

2. Open Source software // Existing open source software solutions

Server-side implementation

The Router is officially recommended. The implementation Proxy is officially provided

Open source software

1. Table and library middleware table and library software Sharding-JDBC also includes read and write separation, that is, according to SQL can be resolved to write or read.

2. Ali.

3.360 the company

Work to use

There is no read-write separation.

Concrete implementation steps

Automatically identifies whether SQL is written or read, and routes to different server nodes. With the introduction of middleware, the application layer is transparent.

conclusion

Use Step 1. Call open source software. Read/write methods () Set read/write; 2. Service operations – Read or write

This is explicitly set to read or write. What if you want to automatically recognize whether you’re reading or writing? When parsing SQL, sharding middleware identifies whether it is read or write based on the keyword field. In general, these solutions are basically transparent to the application level, which means that they are very easy to use. Whether you implement them yourself or use open source software, you basically only need to configure or set the current read/write in the code or completely automatically recognize that no configuration is required. The introduction of middleware is full transparency, and the middleware automatically recognizes whether it is read or write when parsing SQL.

What if there are multiple slave/read nodes? How do I know which slave node it is? All read nodes have the same data and can be routed to any of them.

Read/write separation vs. caching solutions

Summary 1. Read and write separation to solve the database read performance bottleneck 2. Horizontal segmentation to solve the database data volume problem

3. For Internet large data volume, high concurrency, high availability requirements, high consistency requirements, front-end user-oriented business scenarios, micro-service cache architecture, may be more suitable than database read-write separation architecture // cache can solve the problem of read more than read-write separation, because 1. Cache is faster than read database 2. Read/write separation A database is less highly available than cache 3. Database connection pooling also distinguishes between read and write databases, which means that any solution that provides read and write separation, whether implemented in-house or in open source software, needs to distinguish between read and write databases at the application level in terms of database-related aspects


Dear teacher, my personal idea is that we can join the cache. For example, in the business of logging in after registration, we can join the database and cache after registration. When logging in, we can check the cache first and then check the database table. For example, store it in Redis and set an expiration time of ten minutes. When logging in, check the redis first, and then check the database table. If there is no data in redis, it means that the data is expired. At this time, the search machine must exist.

I agree with you that read/write separation should not be implemented at the first performance problem, but should be optimized first, such as optimizing slow queries, adjusting unreasonable business logic, introducing caching, etc. Only after determining that the system has no room for optimization, should read/write separation or clustering be considered. 3. The last option is read/write separation, because read/write separation brings the most change and impact, although the performance is the highest

Note: a single machine can support 100,000 users.

Problems with read/write separation

Replication delay // Data inconsistency during read

Delay time // A large amount of data is more than 1s. A large amount of data is 1 minute

Solution // Core business should follow, non-core business should follow

reference

Juejin. Cn/post / 684490…

Time.geekbang.org/column/arti… // There are database architecture articles for reference, which are basically industry best practices

Database.51cto.com/art/201801/…