Hello, I’m Rich
Before, there were many fans who just got into the hole of Java, and wanted to systematically learn the related technology of database and table, but I have not made up my mind to do it. Now, I catch up with the company’s project of using Sharding-JDBC to transform the existing MySQL architecture into database and table, so I take this opportunity to publish a series of articles about the implementation of database and table. It is also a summary of my architecture learning.
I am on the net land and land also saw a few have about the article of cent library cent table, can discover the data of homogenization on the net is a bit much, and knowledge point compares fragmentary again, still have no detailed actual combat case. In order to further study, I bought some paid courses on some platforms. After watching several courses, I found that it was ok for people with some experience, but it was quite difficult for beginners to learn.
In order to let the novice also can understand, some knowledge points I may use more space to describe, I hope you do not think I am wordy, and so on the end of the sub-database sub-table series of articles, I will make it into a PDF document open source out, can help calculate a! If you find any mistakes or inaccuracies in the article, we welcome you to exchange and correct them.
Specific practice sub – library sub – table before wordy a few words, back to review the basic concept of sub – library sub – table.
What is sub-database sub-table
In fact, branch library and branch table are two concepts, but usually the operation of branch library and branch table will be carried out at the same time, so that we used to call them together as branch library and branch table.
Dividing database into tables is to solve the problem that database performance continues to decline due to the large amount of database and table data. According to certain rules, the original large amount of data database is divided into multiple separate databases, the original large amount of data table is divided into several data tables, so that the single library, the performance of the table to achieve the optimal effect (response speed), so as to improve the overall database performance.
How to divide database and table
The core idea of database and table is to Sharding data, and how to quickly locate data and integrate query results after Sharding. Both sub-database and sub-table can be divided into two latitudes: vertical (vertical) and horizontal (horizontal).
Let’s take the order-related business as an example to see how to do vertical and horizontal shards of libraries and tables.
Vertical segmentation
Vertical segmentation has vertical sub-library and vertical sub-table.
1. Vertical branch library
Vertical branch library is relatively easy to understand, the core concept of four words: dedicated library dedicated.
Tables are classified by business type, and corresponding tables such as orders, payments, coupons, points, etc. are placed in the corresponding database. Developers cannot directly connect to other business databases across libraries. If they want other business data, the corresponding business can provide API interfaces, which is the initial form of microservices.
The vertical division depends largely on the division of business, but sometimes the division of business is not so clear, for example: the division of order data should take into account the association between other businesses, rather than directly put the table related to the order in a database so simple.
To a certain extent, vertical partitioning seems to improve some database performance, but in fact, it does not solve the performance problem caused by the large amount of single table data, so it needs to cooperate with horizontal partitioning to solve the problem.
2, vertical table
Vertical table is based on the data table column (field) as the segmentation, is a large table divided into small table mode.
Such as: Table work_EXTEND (work_EXTEND, work_extend); table work_extend (work_extend, work_extend); Then split the table into different libraries.
We know that the database loads data into memory in behavioral units, so after splitting the core table, most of the fields are accessed frequently, and the field length is short, so we can load more data into memory, to increase the hit ratio of the query, reduce disk IO, in order to improve the database performance.
Advantages of vertical segmentation:
-
Services are decoupled from each other. Data of different services is maintained, monitored, and expanded independently.
-
In high concurrency scenarios, the database pressure is relieved to a certain extent.
Disadvantages of vertical segmentation:
-
The complexity of development is improved. Due to the isolation of services, many tables cannot be accessed directly. Data must be aggregated through interfaces.
-
Distributed transaction management becomes more difficult.
-
The problem of large amount of single table data still exists in the database, which has not been fundamentally solved and needs to cooperate with horizontal segmentation.
The level of segmentation
As mentioned above, vertical sharding still has the problem of large amount of single library and table data. When our application is no longer able to perform fine-grained vertical sharding, there are still single library read/write and storage performance bottlenecks. At this time, horizontal sharding should be combined with horizontal sharding, which can greatly improve database performance.
1. Horizontal branch library
Horizontal branch is to split the same table into different databases according to certain rules, and each library can be located on different servers to achieve horizontal expansion, which is a common way to improve database performance.
This solution often solves single inventory and performance bottlenecks, but because the same table is allocated to different databases, data access requires additional routing work, which increases the complexity of the system.
For example, in the following figure, there are three databases of order DB_1, order DB_1 and order DB_3 with identical table ORDER. When accessing an order, we can specify which database the order should be operated in by modulating the order number of the order, order number mod 3 (number of database instances).
2. Level table
Horizontal partition table is in the same database, a large amount of data table according to certain rules, cut into multiple identical structure of the table, and each table only stores part of the original table data.
For example, an order table with 9 million data can be split horizontally into three tables, ORDER_1, ORDER_2, ORDER_3, each holding 3 million data, and so on.
Although the horizontal table split table, but the sub-table is still in the same database instance, only to solve the problem of large amount of data in a single table, and did not split the table dispersed to different machines, but also competing with the same physical machine CPU, memory, network IO, etc.. To further improve performance, split tables need to be distributed across different databases.
Advantages of horizontal segmentation:
-
Solve the problem of large amount of single database data in high concurrency, improve system stability and load capacity.
-
The workload of business system transformation is not very large.
Disadvantages of horizontal segmentation:
-
Transaction consistency across shards is difficult to guarantee.
-
Cross-library join associated query performance is poor.
-
The difficulty of expansion and maintenance is large, (split into thousands of sub-tables to think of terror).
What is the rule of certainty
We’ve talked a lot about certain rules, which are essentially routing algorithms that determine which tables in which databases a piece of data should be stored in.
The common ones are the mode-taking algorithm and the range limiting algorithm
1. Modulus taking algorithm
Modulo by field (taking the remainder of the hash result (hash() mod N, where N is the number of database instances or child tables) is the most common method of sharding.
Take the order table as an example. First, number the database from 0 to n-1, and mold the order number field of WORK_NO in the ORDER table to get the remainder I. I =0 is stored in the first library, I =1 is stored in the second library, and I =2 is stored in the third library…. And so on.
In this way, the data of the same order will be stored in the same library and table. The same rules are used in the query, and the work_NO order number is used as the query condition, so that the data can be quickly located.
Advantages:
- Data fragmentation is relatively uniform, and it is not easy for requests to hit the same library.
Disadvantages:
- There are some problems with this algorithm. When a machine goes down, the requests that should have been placed in the database cannot be properly processed, and the crashed instances are kicked out of the cluster. The algorithm changes to Hash (userId) mod N-1, and the user information may not be in the same database anymore.
2. Scope limiting algorithm
Split by time interval or ID interval, for example: We can define only 10000 items of data in the User table of each database. The first database only stores the data with userId from 1 to 9999. The second database only stores the data with userId from 10000 to 20000. The third inventory userids are 20001-30000…… And so on over time.
Advantages:
-
The amount of single table data is controllable
-
Horizontal scaling is as simple as adding nodes without migrating data from other shards
-
Can quickly locate the data to query in which library
Disadvantages:
- Continuous sharding may have data hotspots, for example, sharding by time field. In a certain period of time, orders may increase rapidly and data may be read and written frequently. However, historical data stored in some sharding is rarely queried.
The difficulty of sub – library sub – table
1. Distributed transactions
Since tables are distributed in different libraries, cross-library transaction issues are inevitable. The “three-phase commit” and “two-phase commit” processes are commonly used, but this approach has poor performance and requires a lot of code development. The usual approach is to achieve the final consistency of the scheme, if not demanding real-time consistency of the system, as long as the final consistency in the allowed time period can be achieved, using transaction compensation.
Here I apply Ali’s distributed transaction framework Seata to do the management of distributed transactions, which will be combined with actual cases later.
2, paging, sorting, cross-library joint query
Paging, sorting and joint query are frequently used in the development of the functions, but after the database and table, these seemingly common operations are a very headache. Query the data of tables scattered in different libraries, and then summarize all the results and provide them to users.
3. Distributed primary keys
Since we cannot rely on the auto-added primary key on a single database instance to achieve globally unique primary key between different databases, a system that can generate globally unique ID is very necessary, so this globally unique ID is called distributed ID.
4, read and write separation
It is not difficult to find that most of the mainstream relational databases provide master-slave architecture of high availability solutions, and we need to achieve read and write separation + separate database and table, read and write library to do separate database and table processing, there will be specific practical cases.
5. Data desensitization
Data desensitization refers to the data conversion of some sensitive information through desensitization rules, so as to achieve reliable protection of sensitive privacy data, such as id number, mobile phone number, card number, account password and other personal information, generally these need to do desensitization processing.
Subdatabase and subtable tools
I still say that we should try not to build our own wheels, because the wheels made by ourselves may not be so round. There are many mature sub-database and sub-table middleware in the industry, so we choose according to our own business needs and pay more attention to business implementation.
sharding-jdbc
(dangdang)TSharding
(Mushroom Street)Atlas
(Qihoo 360)Cobar
(Alibaba)MyCAT
(Based on Cobar)Oceanus
(58.com)Vitess
(Google)
Why sharding-JDBC
Sharding-jdbc is a lightweight Java framework that provides services in the form of JAR packages. It is a client product that does not require additional deployment. It is equivalent to an enhanced VERSION of JDBC driver. In contrast, server-side products such as Mycat, which require separate deployment services, are slightly more complex. Besides, I want to focus more on the implementation of the business and don’t want to do extra operations.
sharding-jdbc
Compatibility is also very powerful and applies to any basedJDBC
的ORM
Frameworks such as:JPA
.Hibernate
.Mybatis
.Spring JDBC Template
Or directly usedJDBC
.- Perfect compatibility with any third party database connection pool, such as:
DBCP
.C3P0
.BoneCP
.Druid
.HikariCP
Etc., almost all relational databases are supported.
It is not difficult to find that it is indeed a relatively powerful tool, and it is very small on the project, almost do not need to do any code layer modification, also do not need to modify SQL statements, just configure the database table to be divided into tables.
conclusion
A brief review of the basic knowledge of the sub-database sub-table, the following article will introduce the sharding- JDBC in the sub-database sub-table with the actual project.
I sorted out hundreds of technical e-books and gave them to my friends. Pay attention to the public number reply [666] to get yourself. I set up a technology exchange group with some friends to discuss technology and share technical information, aiming to learn and progress together. If you are interested, please join us!