preface
In the high concurrency system, sub-database sub-table is one of the indispensable technical means, but also BAT and other large factories interview, often take the popular test questions.
Do you know why we have to do the inventory table?
The problem starts with two lines: vertical and horizontal.
1 Vertical Direction
The vertical direction is mainly aimed at the business, the following talk about the development of business and sub-database sub-table has any relationship.
1.1 single library
In the early stage of the system, the business functions are relatively simple and the system modules are few.
In order to quickly meet the requirements of iteration, reduce some unnecessary dependencies. More importantly, to reduce the complexity of the system and ensure the speed of development, we usually use a single library to store data.
The initial database architecture of the system is as follows:In this case, the database scheme is as follows: a database contains multiple service tables. User read data request and write data request, are operating on the same database.
Table 1.2
After the system went online, new functions were added continuously with the development of business. As a result, there are more and more fields in a single table, and it’s starting to get a little harder to maintain.
A user table contains dozens or even hundreds of fields, which can be confusing to manage.
What to do at this point?
Answer: separate table.
Split the user table into basic user information table and user extension table.
The basic user information table stores the most important user information, such as user name, password, alias, mobile phone number, email, age, gender and other core data.
This information is closely related to the user, the frequency of the query is very high.
The user extension table stores the user’s extension information, such as: the employer, the location of the account, the city, etc., which is not core data.
This information needs to be queried only in specific business scenarios, and most business scenarios do not.
Therefore, by dividing the core data and non-core data, the structure of the table is clearer, the responsibility is more single, and it is easier to maintain.
In addition to dividing the table according to the actual business, we also have a common principle of dividing the table: put the high frequency of calls in one table, and the low frequency of calls in another table.
A classic example is the order table and the order details table.
1.3 depots
Unconsciously, the system has been online for more than a year. After N iterations of requirements development, the functionality is complete.
The system function is perfect, means the system all sorts of correlation relation, is intricate.
At this point, if you do not quickly comb the business logic, the back will bring a lot of hidden problems, will pit their own death.
So you have to separate areas by business function. Put tables from the same domain in the same database, and tables from different domains in a different database.
The specific separation process is as follows:
The user, product, logistics, order related tables, from the original database, split into separate user database, product database, logistics database and order database, a total of four databases.
I’ve drawn only one table for each library here to make it more intuitive. The actual scene might have multiple tables.
This split by domain allows each domain to focus only on its related tables, making it more accountable and suddenly more maintainable.
1.4 Database and table
Sometimes according to the business, only separate library, or only separate table is not enough. For example, some financial systems need to aggregate all users’ funds by month and year.
This needs to do: sub – library sub – table.
There is a separate database every year, and in each database, there are 12 tables, and each table stores a month’s worth of user money data.
In this way, after the database and table, it can be very efficient to query a user’s monthly or annual funds.
In addition, there are some special requirements, such as the need to separate the database by region, such as: central China, North China, South China, etc., each region has a separate database.
Even some game platforms, according to the access to the game manufacturers to do sub-library sub-table.
2 Horizontal direction
Water direction is mainly aimed at the data, the following talk about data and sub-database sub-table and what is the relationship.
2.1 single library
At the beginning of the system, the amount of concurrency is small because there are very few users. And the amount of data that exists in the table is very small.
The database schema is as follows:At this point, the same database scheme is used: a master database contains multiple business tables.
Both user read and write requests operate on the same database. This scheme is suitable for business scenarios with low concurrency.
2.2 Primary/Secondary Read/Write Separation
After the system went live for a while, the number of users increased.
In this case, you will find that most of the user’s requests are read data requests, but only a small proportion are actually written data requests.
As we all know, database connections are limited and are a valuable resource. Each database read or write request requires at least one database connection.
If the database connection required by the data write request is occupied by the data read request, the data cannot be written.
This is a serious problem.
To solve this problem, we need to separate the reading and writing libraries.
Hence the master-slave read-write separation architecture:Considering that the number of users is not that large at the beginning, we choose a master and slave architecture, which is often said to be one master and one slave.
All write requests are directed to the master library. Data is asynchronously synchronized to the slave library as soon as it is written to the master library. In this way, all read requests can be retrieved from the library in a timely manner (unless there is network latency).
The read/write separation scheme can solve the single-node problem mentioned above, and can guarantee the stability of the system better than the single-library scheme.
Because if the primary fails, you can upgrade the secondary to the primary, redirect all read and write requests to the new primary, and the system works again.
Read/write separation scheme is actually a branch library, which has become the first scheme in the early stage of the system compared with backing up data.
But there is a problem here: if the user is really a bit large, if the master dies, upgrade the slave to master, and redirect all read and write requests to the new master.
But what if the new master simply can’t handle all the read and write requests?
This requires a master and slave architecture:If the master dies, you can choose to upgrade from library 1 or library 2 to the new master. If we upgrade from library 1 to the new master here, the original slave from library 2 becomes the slave of the new master.
The architecture diagram after adjustment is as follows:
That would solve the problem above.
In addition, if the volume of query requests increases, we can also upgrade the architecture to one master, three slaves, one master, four slaves… One principal N sub.
2.3 depots
The above read/write separation scheme can solve the problem that the master node cannot handle when the number of read requests exceeds the number of write requests. But if there is a domain, for example: user libraries. If the volume of requests from registered users is very high, that is, the volume of write requests itself is very high, and a master library cannot handle such a large amount of pressure.
What should I do?
A: Create multiple user libraries.
The user library split process is as follows:
In this case, I have split the user library into three libraries (this is not necessarily the case in real life), and the table structure of each library is exactly the same, except that the data stored is different.
Table 2.4
User requests come up, which is bound to increase the cost of data volume. Even if it is done separately, it is possible for a single library, such as the user library, to have 50 million data.
As a rule of thumb, the amount of data in a single table should be controlled within 10 million to achieve optimal performance. If there are tens of millions of levels of data, with a single table to store, the performance will become very poor.
If the amount of data is too large, the index to be built will be too large. Retrieving data from small to large will be time-consuming and consume CPU resources.
What should I do?
A: Separate tables, so you can control the amount of data in each table, and index size.
Table split process is as follows:
Here I split the user table in the user library into four tables (this is not necessarily the case in real life). The table structure of each table is exactly the same, but the data stored is different.
If the amount of user data becomes larger and larger in the future, only a few more user tables can be divided.
2.5 Sub-database sub-table
When the system develops to a certain stage, the amount of concurrent users is large, and the amount of data needs to be stored is also large. What should I do?
Answer: need to do sub-database sub-table.
As shown below:
The figure splits the user library into three libraries, each containing four user tables.
If there is a user request, route to one of the user libraries according to the user ID, and then locate to a table.
There are many routing algorithms:
- For example, if ID =7 and there are 4 tables, 7%4=3 and the module is 3, route to user table 3.
- For example, if the value of ID is 0-100,000, the data is stored in user table 0; if the value of ID is 100,000-200,000, the data is stored in user table 1.
- Consistent hash algorithm
There will be a later article devoted to these routing algorithms.
3 Real Cases
Next, without further ado, I will share with you the experience of three sub-database sub-table projects that I have participated in, to give a reference to friends in need.
3.1 depots
I once worked in a company. Our team was engaged in game operation. Our company provided a platform for game manufacturers to access our platform and promote their games.
The game player can log in through our platform and jump to the designated game page of the game manufacturer after success. Then the player can play the game normally and recharge the game currency.
It is necessary to establish the mapping relationship between our account system and the account of the game manufacturer. After logging in the game account of our platform, the game player can successfully convert it into the account of the game manufacturer’s own platform.
There are two questions:
- Each game vendor may have different access methods and account system mappings.
- Users log in from our platform, and then jump to the game manufacturer’s game page. At that time, there were N game manufacturers to access, more active game players, the concurrent amount of login interface can not be underestimated.
In order to solve these two problems, we adopted a plan: separate libraries. A separate database is built for each game, and the table structure in the database allows for differences.
We didn’t break it down further because at the time we were considering the number of users for each game, it wasn’t too big. It’s not like a phenomenon like King of Glory, with hundreds of millions of players.
One key aspect is that the login interface needs to pass in the game ID field so that the system knows which library to operate on because the library name contains the game ID.
Table 3.2
In the same game platform company, we have another business: golden Diamond membership.
To put it bluntly, it has created a set of membership system related to the game. In order to keep users active, the opened members have many benefits, such as: game currency, discount for recharge, points exchange, lottery, exclusive customer service and so on.
In this membership system, there is a very important function is: points.
There are many ways for users to earn points, such as: check-in, recharge, play games, draw prizes, promote, participate in events and so on.
What are the integrals for?
- Return or exchange in-kind gifts
- Exchange game currency
- Lucky draw
Having said so much, it is just to say that a user may acquire or consume points many times a day, so a user may generate dozens of records a day.
If you have a lot of users, the amount of data associated with points is actually quite amazing.
We considered that the volume of data in the horizontal direction might be large, but the amount of user concurrency was not large, as was the case with the login interface.
So the scheme adopted is: sub-table.
It was enough to use a points database, but there were 128 tables. Then divide the hash by 128 based on the user ID.
Special attention should be paid to the fact that the number of sub-tables should be the power of 2 to facilitate expansion in the future.
3.3 Sub-database sub-table
Then I went to work for a company that developed catering software. One of the characteristics of this company is that there is a large number of concurrent users during the peak hours of lunch and dinner every day.
Users need to order food through our system before eating, and then place an order, and then check out. There was a lot of simultaneous ordering and ordering.
There may be many people in the restaurant, and each person may place multiple orders. This results in high concurrency for users and a large amount of data.
Therefore, after comprehensive consideration, the technical scheme we adopted at that time is: sub-database sub-table.
After investigation, I felt that I used dangdang open source middleware framework based on JDBC: Sharding – JDBC.
There were four libraries, each with 32 tables.
4 summarizes
Above mainly from: vertical and horizontal, two directions introduced why our system should be divided into database and table.
To be honest, the vertical direction (i.e. the business direction) is easier.
In the horizontal direction (that is, the direction of data), the role of sub-database and sub-table, in fact, is different, can not be confused.
- Branch library: to solve the problem of insufficient database connection resources, and disk IO performance bottleneck.
- Split table: To solve the problem that the amount of data in a single table is too large. When SQL statements query data, it is time-consuming even if the index is used. In addition, it can solve the problem of consuming CPU resources.
- Database and table division: Solves problems such as insufficient database connection resources, disk I/O performance bottleneck, data retrieval time, and CPU resource consumption.
In some service scenarios where a large number of concurrent users need to save a small amount of data, you can separate databases instead of tables.
In some business scenarios where the number of concurrent users is small but the number of stores to be saved is large, only tables can be separated from libraries.
In some service scenarios where a large number of concurrent users need to be saved, you can divide databases into tables.
Well, that’s all for today.
Isn’t that a little too much?
It doesn’t matter, in fact, there are a lot of content related to the sub-database sub-table, this article as the first play of the sub-database sub-table series, as an appetizer, to share with you.
Add a few questions at the end of the article:
- What are the specific implementation schemes of sub-database sub-table?
- How to smoothly expand the database and table?
- What are the problems brought by the separate database and table?
- How to implement the function of sub-database and sub-table in the project?