Welcome to pay attention to the public account “JAVA Front” to view more wonderful sharing articles, mainly including source code analysis, practical application, architecture thinking, workplace sharing, product thinking and so on, at the same time, welcome to add my personal wechat “JAVA_front” to communicate and learn together


1. Theoretical knowledge

1.1 Whether sub-database sub-table is necessary

Split database and table can solve the problem of large amount of data in a single table, but it is not the first choice. Because sub-database sub-table introduces at least three outstanding problems that must be solved.

The first is the complexity of the scheme itself. The second is the problem of local transaction invalidation, which can guarantee the strong consistency of business logic in the same database, but the transaction invalidation after the database separation. Third, it is difficult to aggregate query problems, because shardingKey must be included in the query conditions after the database and table are divided, so many query scenarios are limited.

We have introduced how to solve the problem of large amount of data in a single table in the previous article “The interviewer asked whether the amount of data in a single table must be divided into databases and tables”. We can deal with the problem in the order of deleting, changing, dividing, disassembling, changing, and heating, rather than dividing databases and tables at the beginning.

Delete refers to deleting historical data and archiving it. This means that instead of using only database resources, some data can be stored to alternative resources. Split refers to read and write separation. Multiple read instances are added to cope with the Internet scenario in which more read and less write are required. Splitting refers to dividing the database into different tables to reduce the pressure. Different index data are different. Multiple copies of one piece of data are saved according to different service requirements. Hot refers to hot data, and this is a very noteworthy problem.


1.2 Database and table are divided into two dimensions

Suppose you have an e-commerce database that holds three business tables: order, goods, and payment. As the volume of services increases, the three service data tables become larger and the query performance deteriorates significantly. Therefore, data splitting can be performed in both vertical and horizontal dimensions.


1.2.1 Vertical split

Vertical split means that we split the e-commerce database into three databases: order database and commodity database. Payment library, order table in order library, goods table in goods library, payment table in payment library. In this way, each library only needs to store its own business data, and physical isolation does not affect each other.


1.2.2 Horizontal split

With the vertical split, we now have three libraries that have been running smoothly for a while. However, with the growth of business, the amount of data in each single database and single table becomes larger and larger, gradually reaching the bottleneck.

At this time, we need to carry out horizontal split of the data table. The so-called horizontal split is to disperse the data of single database and single table to multiple databases and multiple tables according to some rules, so as to reduce the pressure of single database and single table.

There are many options for horizontal split strategy. The most important one is to choose ShardingKey, which column to split according to, depending on how we access the data.


(1) Range sharding

If we select ShardingKey as the order creation time, then the sharding strategy is to split four databases to store quarterly data respectively, and each library contains three tables to store monthly data respectively:

The advantage of this scheme is that it is friendly to the range query. For example, if we need to collect the relevant data of the first quarter, the query conditions can be directly entered into the time range. The problem with this scheme is that it tends to generate hot data. For example, if the order volume on The day of Double 11 is particularly large, the data volume of this table in November will be particularly large, resulting in access pressure.


(2) Query table fragments

Table lookup method is to determine which table ShardingKey is routed to according to a routing table. During each routing, the sharding information is first checked in the routing table, and then the data is obtained from this sharding. We analyze a practical case of look-up table method.

RedisCluster was introduced in Redis 3.0, which introduced the concept of hash slots. A cluster has a fixed number of 16384 slots, which are evenly allocated to Redis cluster nodes during cluster initialization. The formula for calculating which slot each key request ends up in is fixed:

SLOT = CRC16(key) mod 16384
Copy the code

How do I know which Redis node to go to for a key request? To do this, use the idea of table lookup:

(1) The client connects to any Redis node and randomly accesses node A. (2) Node A calculates slot values based on the key. (3) Each node maintains A mapping table between slots and nodes. (5) If node A looks up the table and finds that the slot is not on this node, node A returns A redirection command to the client, telling the client which node to go to for data on this key. (6) The client sends A connection request to the correct nodeCopy the code

The advantage of the look-up table method is that routing policies can be flexibly formulated. If some fragments become hot spots, the routing policies can be modified. The disadvantage is that it takes time to query the routing table once more, and the routing table may have a single point of problem.


(3) Hash sharding

Now the more popular sharding method is hash sharding. Compared with scope sharding, hash sharding can distribute data evenly in the database. Now we split the order library into 4 libraries numbered [0,3], and each library contains 3 tables numbered [0,2], as shown in the figure below:

We choose to use orderId as ShardingKey, so which table will the order orderId=100 be stored in? Because it is divided into libraries and tables, the first step is to determine which library to route to, and the module calculation result represents the number of the library table:

db_index = 100 % 4 = 0
Copy the code

The second step is to determine which table to route to:

table_index = 100 % 3 = 1
Copy the code

Step 3 Route data to table 1 of database 0:

The routing logic does not need to be implemented manually in real development, as there are many open source frameworks that implement routing functionality through configuration, such as ShardingSphere, TDDL framework, and so on.


2. Preparation of sub-database and sub-table

2.1 Calculate the number of library tables

Points a few libraries and a few tables in depots must answer the question, before the start of work table we look first at alibaba development manual advice: single table rows than 5 million lines or single table capacity more than 2 gb is recommended for depots table, if three years is expected amount of data is short of this level, please don’t depots table when creating a table.

We extracted the two keywords of this proposal, 5 million and 3 years, as the baseline of the estimated number of database tables. Assuming that the business data increases by 600,000 per day, how should we estimate the number of libraries and tables needed to be divided?

Daily increment of 600,000 Calculate total data after 3 years:

Total amount of data in three years = 60 * 365 * 3 = 65700Copy the code

With the subsequent business development, the daily increment will exceed 600,000, so we need to carry out redundancy on the total amount of data. The redundancy index depends on the business situation. In this paper, triple redundancy is used:

Triple redundancy = 65700 x 3 = 197,100Copy the code

Calculate the number of tables as a single table of 5 million rounded up to the power of 2

Number of tables = 197100/500 = 394.2 Rounded up = 512Copy the code

It is not appropriate to put all tables in one library, because as the volume of data increases, the number of concurrent access increases, and it is difficult to support a single database instance. This article calculates the number of libraries as 32 tables in a database instance:

Number of libraries = 512/32 = 16Copy the code


2.2 shardingKey

Determining shardingKey is critical because, as a sharding indicator, the proxy layer can only route tables based on shardingKey after data is split into multiple library tables. Assuming we set userId as shardingKey, subsequent DML operations must contain the userId field. But now we have a scenario where only orderId is the query condition, so how do we deal with this scenario?

The first solution is to design orderId to contain userID-related characteristics, so that even if only the order number is used as the query condition, userId characteristics can be intercepted for sharding:

Order number = number of milliseconds + version number + last six digits of userId + global sequence numberCopy the code

The second scheme is data heterogeneity. The core idea is to exchange space for time. A piece of data is stored in multiple data media according to different dimensions.

Heterogeneous data to MySQL: We can select orderId as shardingKey to store to another database instance, and orderId can be queried as a condition.

Heterogeneous data to ES: It is not realistic to create a new database instance for each dimension, so we can synchronize data to ES to meet the multidimensional query requirements.

Heterogeneous data to Hive: MySQL and ES can meet real-time query requirements and Hive can meet offline analysis requirements. Data analysis, such as reports, can be performed using Hive instead of the main library.

Now a new problem arises. Businesses cannot write data to multiple data sources at a time. This causes performance problems and data consistency problems, so a pipeline is needed to synchronize data between data sources.


3 Sub-database sub-table instance

After finishing the preparation work, we can start to work on the database and table. There are many ways to divide databases and tables, but ultimately they deal with two types of data: inventory and increment. The amount indicates the data that already exists in the old database, and the increment indicates the data that does not exist in the old database to be added or changed. According to the two types of inventory and increment, we can divide the subdatabase subtable method into stop service split and non-stop service split.


3.1 Disassembly of service shutdown

Stopping service means that the service is stopped and the system no longer receives new service data. In this case, the old data is static during the period of database and table division, and all data is changed into stock data. There are generally three stages in the discontinuation of service.

The first stage is to write the proxy layer and the new DAO. The proxy layer decides whether to access the old table or the new table through the switch. At this time, all the traffic still accesses the old table:

In the second stage, the service is stopped, the entire application has no traffic, and the old table data is in a static state. At this point, the storage data is migrated from the old table to the new table by script:

The third stage accesses the new table through the proxy layer. If there is an error, you can stop the service and troubleshoot the problem:


3.2 Non-stop service split

The solution is relatively simple, but there is no business traffic during the time of table splitting, which is harmful to the business. Therefore, we generally adopt the solution of non-stop service splitting. When there is traffic access, the database is divided into tables at the same time.

The first stage is to write the proxy layer and the new DAO. The proxy layer decides whether to access the old table or the new table through the switch. At this time, all the traffic still accesses the old table:

In the second stage, double write is enabled. Incremental data is added and modified not only in the old table, but also in the new table. The log or temporary table records the start value of the new table ID.

Stage 3 Stock data migration: Write stock data to a new table by script:

Phase 4 Stop reading from the old table To read from the new table. In this case, the new table bears all read and write services. Do not stop writing to the old table immediately.

There are two reasons to keep writing to the old table. First, if there is a problem with reading the new table, the read traffic can be switched back to the old table. Second, data can be calibrated. For example, data of new and old tables can be synchronized to Hive, and the data of a few days can be calibrated to verify the accuracy of data synchronization.

Phase 5 When no service problems occur after reading and writing the new table for some time, you can stop writing the old table:


3.3 Proxy layer implementation

The broker layer, which switches between old and new data sources, needs to minimize the intrusion of business layer code, while the adapter pattern can effectively reduce the intrusion of business layer code. Let’s start with old data access objects and business services:

// Order data object
public class OrderDO {
    private String orderId;
    private Long price;

    public String getOrderId(a) {
        return orderId;
    }

    public void setOrderId(String orderId) {
        this.orderId = orderId;
    }

    public Long getPrice(a) {
        return price;
    }

    public void setPrice(Long price) {
        this.price = price; }}/ / the old DAO
public interface OrderDAO {
    public void insert(OrderDO orderDO);
}

// Business services
public class OrderServiceImpl implements OrderService {

    @Resource
    private OrderDAO orderDAO;

    @Override
    public String createOrder(Long price) {
        String orderId = "orderId_123";
        OrderDO orderDO = new OrderDO();
        orderDO.setOrderId(orderId);
        orderDO.setPrice(price);
        orderDAO.insert(orderDO);
        returnorderId; }}Copy the code

Introducing new data source access objects:

// New data object
public class OrderNewDO {
    private String orderId;
    private Long price;
}

/ / new DAO
public interface OrderNewDAO {
    public void insert(OrderNewDO orderNewDO);
}
Copy the code

Adapter pattern reduces business code intrusion:

/ / agent layer
public class OrderDAOProxy implements OrderDAO {
    private OrderDAO orderDAO;
    private OrderNewDAO orderNewDAO;

    public OrderDAOProxy(OrderDAO orderDAO, OrderNewDAO orderNewDAO) {
        this.orderDAO = orderDAO;
        this.orderNewDAO = orderNewDAO;
    }

    @Override
    public void insert(OrderDO orderDO) {
        if(ApolloConfig.routeNewDB) {
            OrderNewDO orderNewDO = new OrderNewDO();
            orderNewDO.setPrice(orderDO.getPrice());
            orderNewDO.setOrderId(orderDO.getOrderId());
            orderNewDAO.insert(orderNewDO);
        } else{ orderDAO.insert(orderDO); }}}// Business services
public class OrderServiceImpl implements OrderService {

    @Resource
    private OrderDAO orderDAO;
    @Resource
    private OrderNewDAO orderNewDAO;

    @Override
    public String createOrder(Long price) {
        String orderId = "orderId_123";
        OrderDO orderDO = new OrderDO();
        orderDO.setOrderId(orderId);
        orderDO.setPrice(price);
        new OrderDAOProxy(orderDAO, orderNewDAO).insert(orderDO);
        returnorderId; }}Copy the code


4 Article Summary

There are three problems that have to be faced: the complexity of the scheme itself, the invalidation of local transactions, and the difficulty in aggregating queries. Therefore, the scheme of library and table is not the first choice to solve the problem of massive data.

If it is necessary to divide the database and table, first perform capacity estimation and select the appropriate shardingKey, then choose the stop service or stop service scheme according to the actual business. If the stop service scheme is selected, pay attention to keep the new table and the old table written for a period of time, so as to verify the accuracy of data. I hope this paper will be helpful to you.


5. Read more

A simple and landable distributed transaction scheme

The interviewer asked whether it is necessary to separate databases and tables for the large amount of data in a single table


Welcome to pay attention to the public account “JAVA Front” to view more wonderful sharing articles, mainly including source code analysis, practical application, architecture thinking, workplace sharing, product thinking and so on, at the same time, welcome to add my personal wechat “JAVA_front” to communicate and learn together