Words are bound to go wrong, and if they don’t, thank God!

preface

In my last article, “Remember a Unicorn dual-system migration merge Solution,” I told you about a dual-system data migration merge solution at a previous company. I was very pleased to see that there were many friends who liked and left messages. However, some of my friends said that my solution was too troublesome and complicated, and it was better to double write the solution. Double-write is indeed a very common and popular solution for data migration. Just a while ago, one of the company’s business tables for horizontal expansion, with a double – write scheme. However, there are a lot of things to be aware of when it comes to double-writing, and it may not be as random as people think. So here to do a share for you, wrong place also hope you to criticize and correct.

background

A business table in one of the company’s business libraries was simply divided into 32 sub-tables because such a large amount of data was not foreseen. At present, due to the rapid expansion of services, the amount of data in a single table is more than 800W, and the largest table has reached 1000W. As a direct result, the query of relevant businesses is slower and slower (of course, we do not go into the query method, index, disk and other factors, I personally think, Any database tuning technique is nonsense when it comes to absolute data. Therefore, these sub-tables need to be expanded horizontally.

Analysis of the

When it comes to database table horizontal expansion, I think the simplest must be to stop service expansion. As long as one night, the service completely offline, and then background data migration work; Wait until the migration is complete, then change the branch table routing mode, come online again, and read the logs on Kibana while drinking coffee. Have to say, this kind of simple without brain really directly accessible approach, but the downside is also evident, one thousand, some users may want to try so hard to use the service, and the development is tsundere to death, “the system is being upgraded, shall not use”, who wouldn’t upset the shelf – have to make wrong, I paid the money, I am god! So scaling as smoothly and horizontally as possible is what we developers, or service providers, want. So to summarize, we want the effect is: in the case of a single library with multiple tables, as smooth as possible horizontal expansion. And I can define the role service:

  1. Data business service
  2. Data Migration Service
  3. Data inspection service

Double write

In the aspect of the horizontal expansion of the library table, many people will adopt the double-write scheme. In this case, I double-write because the business context only involves simple models such as single-library, multi-table, so double-write is a little easier. Compared to the scenario in the previous article, the biggest factor is whether or not it crosses libraries. When cross-library, MySQL’s transaction mechanism cannot be guaranteed and has to rely on distributed transaction components. The people who talk about double-writing are either already providing distributed transaction infrastructure or simply not thinking about it. Moreover, the introduction of distributed transaction components, such as seats, etc., will inevitably increase the complexity and maintenance difficulty of the program, so for careful friends, we should think about these problems. In the case of double-write, if distributed transactions are not considered, the data calibration program can only be relied on when the data is inconsistent, but this does not guarantee real-time performance. In addition, there are many factors to consider, which is exactly what this article wants to share with you!

The experiment

Those of you who have read my previous articles must know that I have always been very protective of the company’s business and code, so HERE I can only use a simple model to simulate, you can also carry out the actual operation on your own machine. Since this sharing focuses on the horizontal expansion scheme, we will not introduce ShardingSphere or MyCat and other sub-library sub-table schemes, but adopt simple modular routing. In addition, 1000 users are simulated here, and the user ID range is [0, 999].

Library table

We need a business table, so I’ll take the order table we often see. Let’s assume that there were only two tables order_0 and order_1 in the system, and now we want to double that number to four sub-tables.

CREATE TABLE `order_x` (
  `id` bigint(32) NOT NULL AUTO_INCREMENT COMMENT 'primary key ID',
  `user_id` int(11) DEFAULT NULL COMMENT 'user ID',
  `item` varchar(100) DEFAULT NULL COMMENT 'Trade Name',
  `count` int(11) DEFAULT NULL COMMENT 'Quantity of goods'.PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Copy the code

The order table contains only four simple fields, as you can see from the comment. Since it is a single library, we can take into account the split table. Here, our routing strategy is to take module routing according to user ID. And then we can insert some data into our order_0 and order_1 tables.

@RequestMapping(value = "/insertData", method = RequestMethod.GET)
public Object insertData(a) {
    final String sql0 = Constants.insertMap.get(0);
    final String sql1 = Constants.insertMap.get(1);
    List<Object[]> params0 = Lists.newArrayListWithExpectedSize(1000);
    List<Object[]> params1 = Lists.newArrayListWithExpectedSize(1000);
    int index = 1;
    for (int i = 0; i < 100000; i++) {
        long id = idWorker.nextId();
        String userId = chooseUser();
        int userIdInt = Integer.parseInt(userId);
        String item = RandomStringUtils.random(2.true.false);
        Object[] param = new Object[]{id, userIdInt, item, 1};
        if ((userIdInt & 1) = =0) {
            params0.add(param);
        } else {
            params1.add(param);
        }
        if (i % 2000= =0) {
            jdbcTemplate.batchUpdate(sql0, params0);
            jdbcTemplate.batchUpdate(sql1, params1);
            params0.clear();
            params1.clear();
            log.info("The first" + index++ + "Secondary data insertion"); }}return "SUCCESS";
}
Copy the code

The ID generated here uses the snowflake algorithm. (Amazing, there are some people who don’t know that MySQL primary key increment is not recommended in the case of partition table!)

Data inspection service

Now that we have the above data generated, let’s do the simplest work first. There really isn’t much to say about this. As I mentioned in the last article, because this involves their respective business conditions, how to check the data depends on the specific situation. For example, the banking system should be concerned about whether the balance of users is correct, and the storage system should be concerned about whether the inventory of goods is correct. If not, do it line by line, field by field. In this case, I would query how many orders each user placed and the total number of items involved as a check.

/** * Check work * mainly to check the results of the two phases * 1. Data differences caused by migration efforts (migration efforts should ensure that this part of the error free) * 2. Data differences caused by double write of normal traffic after migration work * < P > * In addition, check work is not a one-time task, but needs to be queried regularly * Here for simplicity, to query the number of orders placed by each user and the total number of goods involved as check * Each scene needs specific in-depth judgment, This depends on the audience's specific business scene
private void doCheckWork(a) {
    Map<Integer, String> map1 = Maps.newHashMap();
    map1.put(0."select count(*) from order_0 where user_id = ?");
    map1.put(1."select count(*) from order_1 where user_id = ?");
    map1.put(2."select count(*) from order_2 where user_id = ?");
    map1.put(3."select count(*) from order_3 where user_id = ?");
    Map<Integer, String> map2 = Maps.newHashMap();
    map2.put(0."select sum(count) from order_0 where user_id = ?");
    map2.put(1."select sum(count) from order_1 where user_id = ?");
    map2.put(2."select sum(count) from order_2 where user_id = ?");
    map2.put(3."select sum(count) from order_3 where user_id = ?");
    // Since the test data is simulated from 0 to 1000 users, it is treated as all users
    int pass = 0;
    int fail = 0;
    for (int i = 0; i < 1000; i++) {
        int from = i % 2;
        int to = i % 4;
        if (from == to) {
            pass++;
            continue;
        }
        Integer fromCount = jdbcTemplate.queryForObject(map1.get(from), new Object[]{i}, Integer.class);
        Integer toCount = jdbcTemplate.queryForObject(map1.get(to), new Object[]{i}, Integer.class);
        Integer fromSum = jdbcTemplate.queryForObject(map2.get(from), new Object[]{i}, Integer.class);
        Integer toSum = jdbcTemplate.queryForObject(map2.get(to), new Object[]{i}, Integer.class);
        if (fromCount.equals(toCount) && fromSum.equals(toSum)) {
            log.info("User ID." + i + "Check passed, number of original tables:" + fromCount + ", number of new tables:" + toCount + Total number of goods in the original table: + fromSum + ", new table total number of goods:" + toSum);
            pass++;
        } else {
            log.error("User ID." + i + "Check failed, number of original tables:" + fromCount + ", number of new tables:" + toCount + Total number of goods in the original table: + fromSum + ", new table total number of goods:" + toSum);
            fail++;
        }
    }
    System.out.println("Pass the inspection:" + pass + ", check failed: + fail);
}
Copy the code

The purpose of the check work is to check the data in two phases, one is to check the data after the migration work is finished, and the other is to check the data during the double-write process, so this work is a continuous rather than one-time.

The flow chart

Well, I think I’m going to draw a simple flow chart, otherwise it’s not very professional, and… (I know you all just want to see pictures and not words, haha, because I also ^_^)

The first is the flow chart of the data business service

Next is the flow chart for migrating the data service

Looking at these two graphs, and comparing the two graphs in the previous article, you can see that the double-write scheme here is still a bit of a hassle.

Data business service

In the data business service part, because we need to ensure that the service can be guaranteed in the process of data migration, so it is possible to accept new SQL at this time, that is, we often say add, delete, change and check these four TYPES of SQL operations. Let’s examine the four types of SQL operations:

  1. Query operations: Since query operations do not change the data of the database table, they can be ignored.
  2. New operation: The new order record is double-write, so it will generate a record in the old table and new table. Note that when a new record is inserted into a new table, the data migration service is in one of two states: 1). If the data of the current user has been migrated, the data on both ends can be consistent under the guarantee of transaction. 2). The data of the current user has not been migrated or is being migrated, so a new record will be inserted into the new table preferentially, and primary key conflicts will occur during the migration process. Note that this type of log is caused by primary key integrity constraints on the database to ensure data consistency on both ends.
  3. Delete operation: In this operation, five states need to be considered: 1). The data of the current user has been migrated, and the data on both ends can be consistent under the guarantee of transaction. Therefore, the number of affected rows returned after deletion must be greater than 0. 2) the current user’s data has not been migrated, so after the operation, the old table data will be executed, but the new table because there is no data, so there is no impact; 3) The current user’s data is being migrated and the data has been inserted, so the nature is the same as the first case; 4) The data of the current user is being migrated and the data has not been inserted, and the data has not been loaded. 5). The current user’s data is migration and haven’t insert the data, but have to load the data in memory, so requires attention at this time, if just do a simple double writing, then caused the old table data to be deleted instead but by inserting a new table data (because the new table started to remove the influence of the return line number is zero, But the data migration service inserts later than this operation), this data is a dirty data. So we need a place to store the SQL for this type of execution. That is, you want the data migration service to notice that you still need to execute this SQL.
  4. Update operation: Many friends may feel that deletion and update operation are not the same, why is there no double writing in the flow chart? At first I also think so, according to the delete operation, the use of double write + add SQL scheme is not ok? But as I was writing, the word “ABA” suddenly appeared in my head. Yes, the update operation is different from the delete operation. The delete operation is for the whole record itself, that is, the record has only two states — deleted or not deleted, and the delete operation itself determines an end-state operation. The update operation, however, is different. It targets the fields in the record and emphasizes the sequential nature. For example, when A business request changes A to B, the data migration service loads the data, but before the new table is written, another business request changes B to C. Because this record has not been written to the new table, the number of rows affected by the update is 0. Then the data migration service needs to insert the newly loaded data into the new table, this time will cause the old table data and the new table data inconsistency problem! The diagram below:

Because of these special (and, at the moment, more extreme) cases, this kind of simple double writing is definitely worth extra consideration. In this implementation, I use ConcurrentHashMap to store the SQL operations required for each user, which are stored in ConcurrentLinkedQueue. These two data structures are used to ensure concurrent and sequential operations.

Of course, I don’t know if you noticed, but I used a distributed lock to get this memory queue. Let’s look inside the implementation here

public static Map<Integer, ConcurrentLinkedQueue<Tuple2<String, Object[]>>> appendSQLs = new ConcurrentHashMap<>();

RLock rLock = redissonClient.getLock(Constants.D_KEY + userId);
ConcurrentLinkedQueue<Tuple2<String, Object[]>> queue = null;
try {
    rLock.lock();
    queue = Constants.appendSQLs.get(userId);
    if (queue == null) {
        Constants.appendSQLs.put(userId, newConcurrentLinkedQueue<>()); }}catch (Exception e) {
    e.printStackTrace();
} finally {
    rLock.unlock();
}
Copy the code

Although ConcurrentHashMap and ConcurrentLinkedQueue are thread-safe as individuals, this is no longer an atomic operation if fetched-before-set. In a concurrent environment, this is where distributed locking is needed.

Data Migration Service

In fact, after knowing the whole process, especially after describing the data business service process above, this step will be clearer. The basic principle is to query relevant records based on the user. Because there may be a relatively large amount of data, it is necessary to obtain it in pages and then insert it. After the insert is complete, you need to check to see if there is any MORE SQL to append, and if so, you need to execute until the queue is consumed. Now let’s think about, does this place need to be locked? I’ll give you a minute to think about it… … … … … … … … OK, I know you’re not thinking, hahaha. I brought this up because I thought about it as I wrote. Although ConcurrentLinkedQueue is thread-safe, it is possible to enqueue and unqueue additional SQL. Why would there be such a concern? This is because I wonder if there will be a situation where the data migration service has consumed all the additional SQL and is ready to execute the next user. At this time, the memory queue will not be consumed if the new SQL is added. This is not necessary for additions and deletions, but extra consideration is required for updates. Therefore, these users need to be identified to indicate whether the migration is complete. In this way, during the update operation, only need to determine whether the user migration is complete operation. If the user is finished migrating, then double write the transaction; If the migration is not complete, then you need to append SQL. So you need to set it up in advance:

/** * Initializes user data in Redis */
@RequestMapping(value = "/initRedisUser", method = RequestMethod.GET)
public Object initRedisUser(a) {
    for (int i = 0; i < 1000; i++) {
        RBucket<String> redisUser = redissonClient.getBucket(Constants.USER_KEY + i);
        redisUser.set("0");
    }
    return "SUCCESS";
}
Copy the code

Memory queue

This is where the appending SQL is stored in a memory queue. In fact, you can compare this to the technology selection in the last article, where the storage medium was MySQL. You may wonder why MySQL is not used this time, but stored in memory, in case the machine crashes. Now there are multiple machines, so aren’t there queues maintained on each machine? I think it is very careful of you to have this consideration. In the actual operation, I also use MySQL as storage, which is due to the transaction mechanism of MySQL. But in this case, since I’m presenting you with a stand-alone experiment, it’s easier and more efficient to use this memory queue (ok, I admit I’m too lazy to bother writing). So, if we do use the memory queue as storage medium in the case of a single machine, then if the service fails, there will be no SQL written in memory. There are two ways to handle this situation:

  1. Truncate a new table to restart data migration (of course, this is not recommended).
  2. Based on the user id written in Redis, we can determine which users are still not fully migrated, so we can just start migrating from these users again. Need to consider is, however, may hang in the service, as part of the user data has been inserted into a new table, you will need to give this part of the data to delete, and then start again, this is because the additional SQL does not exist in the in-memory queue and this part of the state of the data cannot be consistent with the old table, Unless you compare it field by field with the primary key in the old table.

Specific operation

Next, summarize the specific steps of the experiment for you:

  1. /initRedisUser: Call this interface to initialize the user’s identity into Redis
  2. /migrationWork: This interface is called to perform the migration
  3. /doubleWrite: This interface is invoked to simulate user business traffic
  4. /checkWork: This interface is called to check whether the data is consistent

You can run a wave of ha locally! Then look at the journal. I use IDEA, you remember to output the console logs to the file to view, so it will be more convenient.
The complete code has also been uploaded G i t h u b on The full code has also been uploaded to Github
.

extension

So at this point, it seems that there is another article in the water (but lun is really working hard on writing, teh). Let’s see if this is more feasible. After the horizontal expansion of the company’s library table, despite a wave of business blows from everyone, I was still wondering, what if it was a cross-library scenario? Is this double writing ok? This is exactly what I said at the beginning, because it is a single library multi-table operation, so you can use MySQL transactions. However, in the case of multi-library, we can only rely on distributed transaction components, such as Seats of Ali. In the case of not establishing such basic service completely, it will definitely backfire to use this kind of technical selection rashly. In fact, from my above process, more ideas are actually borrowed from the previous article. However, the two ultimate states are one that relies on transactions to double write and one that guarantees data at both ends by consuming messages in RocketMQ. One is true double-write, and the other looks like synchronization from the master library. In fact, I said so much, or to give you a train of thought, or according to their own specific business scenarios to choose a more appropriate technology selection and thinking, do not parrot, or have their own thinking. I think, only a little learn to think, people will make a little progress!

The last

Color {red}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail}{thumbnail} Also attached here is my Github address :github.com/showyool/ju…

Welcome to follow my wechat official account: [a code with temperature]