100000 data batch insert, exactly how to do fast?

@[TOC] Last week, Songgo reprinted a data batch insert article, in which we talk about the problem of data batch insert, how to do the fast.

After reading the article, a friend put forward a different opinion:

Songgo talked with BUG seriously and basically understood what his friend meant, so I wrote a test case and reorganized today’s article, hoping to discuss this problem with friends and welcome friends to put forward better solutions.

1. Analysis of ideas

Batch insert this problem, we use JDBC operation, in fact, there are two ideas:

Use a for loop to insert data one by one (this requires turning on batch processing).
Generate an insert SQL, similar to thisinsert into user(username,address) values('aa','bb'),('cc','dd')....

Which one is faster?

Let’s consider this in two ways:

Insert the efficiency of the EXECUTION of SQL itself.
Network I/O.

Let’s start with the first option, which uses the for loop to insert:

The advantage of this solution is that PreparedStatement in JDBC has a pre-compiled function, which will be cached after pre-compiled. Subsequent SQL execution will be faster, and JDBC can enable batch execution, which is very powerful.
The disadvantage is that many times our SQL server and application server may not be the same one, so network IO must be considered. If network IO is time-consuming, it may slow down SQL execution.

The second option is to generate an SQL insert:

The advantage of this solution is that there is only one network I/O, even if the fragment processing is only several network I/OS, so this solution does not spend much time on network I/O.
Of course, this scheme has several disadvantages. One is that SQL is too long, and may even need batch processing after sharding. Second, the advantages of PreparedStatement pre-compilation cannot be fully exploited. SQL has to be re-parsed and cannot be reused. Third, the resulting SQL is so long that it takes time for the database manager to parse it.

So what we end up thinking about is do we spend more time on network IO than SQL inserts? This is the core issue that we have to consider.

2. Data testing

Now let’s do a simple test. Let’s batch insert 50,000 pieces of data.

Start with a simple test sheet:

CREATE TABLE `user` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(255) DEFAULT NULL,
  `address` varchar(255) DEFAULT NULL,
  `password` varchar(255) DEFAULT NULL.PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Copy the code

Create a Spring Boot project, import the MyBatis dependency and MySQL driver, and then configure the database connection information in the application.

spring.datasource.username=root
spring.datasource.password=123
spring.datasource.url=jdbc:mysql:///batch_insert? serverTimezone=Asia/Shanghai&rewriteBatchedStatements=true
Copy the code

Notice that there is an extra parameter in the database connection URL, rewriteBatchedStatements, which is the core.

By default, the MySQL JDBC driver ignores the executeBatch() statement and sends a batch of SQL statements to the MySQL database one by one. Batch inserts are actually single inserts, resulting in poor performance. Set the rewriteBatchedStatements parameter to true so that the database driver will execute the SQL in batches.

OK, that’s all we need to do.

2.1 Test of Scheme 1

First, let’s look at the test of scenario 1, which is a string of inserts (actually batch processing).

Create the corresponding mapper as follows:

@Mapper
public interface UserMapper {
    Integer addUserOneByOne(User user);
}
Copy the code

The corresponding XML file is as follows:

<insert id="addUserOneByOne">
    insert into user (username,address,password) values (#{username},#{address},#{password})
</insert>
Copy the code

The service is as follows:

@Service
public class UserService extends ServiceImpl<UserMapper.User> implements IUserService {
    private static final Logger logger = LoggerFactory.getLogger(UserService.class);
    @Autowired
    UserMapper userMapper;
    @Autowired
    SqlSessionFactory sqlSessionFactory;

    @Transactional(rollbackFor = Exception.class)
    public void addUserOneByOne(List<User> users) {
        SqlSession session = sqlSessionFactory.openSession(ExecutorType.BATCH);
        UserMapper um = session.getMapper(UserMapper.class);
        long startTime = System.currentTimeMillis();
        for (User user : users) {
            um.addUserOneByOne(user);
        }
        session.commit();
        long endTime = System.currentTimeMillis();
        logger.info("Insert SQL one by one takes time {}", (endTime - startTime)); }}Copy the code

Here I would like to say:

Although it is inserted one by one, we need to enable BATCH mode, so that only one SqlSession is used before and after. If not using BATCH mode, repeatedly obtaining and releasing Connection will consume a lot of time, and the efficiency is extremely low. This is an incredibly inefficient way that Songo doesn’t test.

Write a simple test interface to look at:

@RestController
public class HelloController {
    private static final Logger logger = getLogger(HelloController.class);
    @Autowired
    UserService userService;
    /** ** insert */ line by line
    @GetMapping("/user2")
    public void user2(a) {
        List<User> users = new ArrayList<>();
        for (int i = 0; i < 50000; i++) {
            User u = new User();
            u.setAddress(Guangzhou: + i);
            u.setUsername(Zhang SAN: + i);
            u.setPassword("123："+ i); users.add(u); } userService.addUserOneByOne(users); }}Copy the code

Write a simple unit test:

/** ** unit test plus transaction is designed to automatically roll back after insertion to avoid affecting the next test result * insert by insert */
@Test
@Transactional
void addUserOneByOne(a) {
    List<User> users = new ArrayList<>();
    for (int i = 0; i < 50000; i++) {
        User u = new User();
        u.setAddress(Guangzhou: + i);
        u.setUsername(Zhang SAN: + i);
        u.setPassword("123：" + i);
        users.add(u);
    }
    userService.addUserOneByOne(users);
}
Copy the code

As can be seen, it takes 901 ms, and 5W pieces of data are inserted in less than 1 second.

2.2 Test of Scheme 2

The second option is to generate an SQL and then insert.

The mapper is as follows:

@Mapper
public interface UserMapper {
    void addByOneSQL(@Param("users") List<User> users);
}
Copy the code

The SQL is as follows:

<insert id="addByOneSQL">
    insert into user (username,address,password) values
    <foreach collection="users" item="user" separator=",">
        (#{user.username},#{user.address},#{user.password})
    </foreach>
</insert>
Copy the code

The service is as follows:

@Service
public class UserService extends ServiceImpl<UserMapper.User> implements IUserService {
    private static final Logger logger = LoggerFactory.getLogger(UserService.class);
    @Autowired
    UserMapper userMapper;
    @Autowired
    SqlSessionFactory sqlSessionFactory;
    @Transactional(rollbackFor = Exception.class)
    public void addByOneSQL(List<User> users) {
        long startTime = System.currentTimeMillis();
        userMapper.addByOneSQL(users);
        long endTime = System.currentTimeMillis();
        logger.info("Merge into a SINGLE SQL insert takes time {}", (endTime - startTime)); }}Copy the code

Then tune this method in the unit test:

/** * merge into a single SQL insert */
@Test
@Transactional
void addByOneSQL(a) {
    List<User> users = new ArrayList<>();
    for (int i = 0; i < 50000; i++) {
        User u = new User();
        u.setAddress(Guangzhou: + i);
        u.setUsername(Zhang SAN: + i);
        u.setPassword("123：" + i);
        users.add(u);
    }
    userService.addByOneSQL(users);
}
Copy the code

You can see that it took 1805 milliseconds to insert 50,000 pieces of data.

As you can see, generating an SQL is still a bit less efficient.

You also need to note that the second scheme also has a problem, is when a large quantity of data, the generated SQL will be particularly long, MySQL could not one-time handling such a large SQL, this time will need to modify the shard MySQL configuration or to insert data processing, these actions will lead to insert for longer.

2.3 Comparative Analysis

Obviously, plan 1 has the advantage. When 100,000 or 200,000 data are inserted in batches, the advantages of scheme 1 are more obvious (Scheme 2 requires modification of MySQL configuration or fragmentation of inserted data).

3. What did MP do?

MyBatis Plus also has a batch insert method saveBatch, let’s look at its implementation source code:

@Transactional(rollbackFor = Exception.class)
@Override
public boolean saveBatch(Collection<T> entityList, int batchSize) {
    String sqlStatement = getSqlStatement(SqlMethod.INSERT_ONE);
    return executeBatch(entityList, batchSize, (sqlSession, entity) -> sqlSession.insert(sqlStatement, entity));
}
Copy the code

As you can see, the sqlStatement you get here is an INSERT_ONE, which is inserted one at a time.

The executeBatch method looks like this:

public static <E> boolean executeBatch(Class<? > entityClass, Log log, Collection<E> list,int batchSize, BiConsumer<SqlSession, E> consumer) {
    Assert.isFalse(batchSize < 1."batchSize must not be less than one");
    return! CollectionUtils.isEmpty(list) && executeBatch(entityClass, log, sqlSession -> {int size = list.size();
        int i = 1;
        for (E element : list) {
            consumer.accept(sqlSession, element);
            if ((i % batchSize == 0) || i == size) { sqlSession.flushStatements(); } i++; }}); }Copy the code

Note that the third argument in return is a lambda expression, which is the core logic for batch inserts in MP. MP fragments data first (default fragment size is 1000), and inserts data one by one after the fragmentation is complete. If you continue to look at the executeBatch method, you will see that the sqlSession is actually a batch sqlSession, not a normal sqlSession.

To sum up, the batch insertion scheme in MP is actually the same as the batch insertion idea in section 2.1.

4. Summary

Ok, after the above analysis, now you know how to do batch insert?

There are three unit test methods in the case, run directly, you can see the difference in the time of batch insert (database script in the resources directory).

Interested partners might as well try ~

Finally, thank you again for your advice ~

100000 data batch insert, exactly how to do fast?

1. Analysis of ideas

2. Data testing

2.1 Test of Scheme 1

2.2 Test of Scheme 2

2.3 Comparative Analysis

3. What did MP do?

4. Summary

Related Posts

MySQL > select * from ‘MySQL’;

Kibana: Visualizations of two metrics using TSVB metrics

Mybatis batch processing step pit, correct some errors written on the Internet