One, foreword
In recent days, while developing a company business, I came across a scenario where I had to insert a large amount of data into multiple tables in a different database, which led to this article:
Note when using Mybatis batch insert data, and simple encapsulation of batch data insert method using functional programming.
For most of the Java backend development partners including me in the usual CURD development work, must be unavoidable to use Mybatis this tool to operate on the database.
In the SpringBoot project, after the introduction of Mybatis, mapper injection can be used to achieve the increase, deletion, change and check.
For example, if you want to add a new piece of data, you can write this in the mapper. XML file:
<insert id="testInsert">
insert into massive_data_insert_test
(value1,value2)
values
(#{value1},#{value2})
</insert>
Copy the code
Then call mapper.insertoneItem (insertItem) in the service layer; Can.
If you want to add more data, you might write:
for(int i = 0; i < insertList.size(); i++){
mapper.insertOneItem(insertList.get(i));
}
Copy the code
Simply put it in a loop and call the new database method in mapper several times. This method is simple to write and easy to understand, and should not be a problem when the amount of data is small.
But when there’s too much data, there’s a problem.
In fact, each mapper method call is a database connection, PreparedStatement, execute (SQL) process.
We know that the number of connections to the database is limited (and the process of establishing a connection is expensive). If the number of for loops is too many, not only performance will deteriorate, but the database will clog and even the program will crash. Of course, we could create or configure database connection pools (such as HikariCP, Durid, etc.) to reuse connections, but this would still be a waste of resources.
In short, don’t do it multiple times if there is a way to do it all at once.
Most database new statements support multiple data insertions at once.
insert into table
(value1, value2)
values ('v1'.'v2'), ('v3'.'v4'), ('v1'.'v5')
Copy the code
Mybatis also provides a method for batch manipulation of data. Using the
tag in the dynamic SQL statement will help us concatenate the SQL statement that looks like the one above.
We write the following in mapper.xml:
<insert id="testMassiveInsert">
insert into massive_data_insert_test
(value1,value2)
values
<foreach collection="list" item="item" separator=",">
(#{item.value1},#{item.value2})
</foreach>
</insert>
Copy the code
This way we only need to call the method in mapper once to achieve the effect of the for loop code above.
This dynamic SQL statement is used to concatenate the contents of the passed parameters into the SQL inserted into the statement (preprocessing technique).
This approach is clearly better than the initial implementation of the for loop.
2. An error occurs when the volume of batch inserted data reaches tens of thousands
However, when we use the above concatenated SQL method for bulk inserts, the amount of data will also be a problem!
We can experiment with what happens when we batch insert 40,000 pieces of data.
Start by creating a new table as the target table for inserting data: massive_datA_insert_test.
CREATE TABLE "supply"."massive_data_insert_test" (
"value1" varchar(255) COLLATE "pg_catalog"."default",
"value2" varchar(255) COLLATE "pg_catalog"."default"
);
Copy the code
Create a mapper in a SpringBoot project, create a mapper in a SpringBoot project, and create a mapper in a SpringBoot project.
Below are the MAPper interface and the SQL statements in the mapper.xml file.
TestMapper.java
@Repository
public interface TestMapper.java {
void testMassiveInsert(List<HashMap> list);
}
Copy the code
TestMapper.xml
<insert id="testMassiveInsert">
insert into massive_data_insert_test
(value1,value2)
values
<foreach collection="list" item="item" separator=",">
(#{item.value1},#{item.value2})
</foreach>
</insert>
Copy the code
Test statement:
@Service
@Slf4j
public class TestService {
// The maximum number of new batches
private static final int maxInsertItemNumPerTime = 500;
private TestMapper mapper;
@Autowired
public TestService(TestMapper mapper) {
this.mapper = mapper;
}
public Result testMassiveInsert(a) {
long startTime = System.currentTimeMillis(); // Get the start time
List<HashMap> list = new LinkedList<>();
// Get a list with a length of 500 * 80 = 40000
for (int i = 0; i < maxInsertItemNumPerTime * 80; i++) {
HashMap map = new HashMap();
map.put("value1"."value1" + i);
map.put("value2"."value2" + i);
list.add(map);
}
// Direct batch insert
try {
mapper.testMassiveInsert(list);
} catch (RuntimeException e) {
log.info("Direct batch insert" + list.size() + "Failure", e.getMessage());
throw new RuntimeException(e);
}
long endTime = System.currentTimeMillis(); // Get the end time
return Result.ok().message("Program runtime:" + (endTime - startTime) + "ms"); }}Copy the code
When performing the direct bulk insert above:
I/O error is reported directly, why is this?
As mentioned in the previous article, the purpose of this dynamic SQL statement is to concatenate the contents of the passed parameters into the SQL inserted statement. Therefore, this error occurs because the SQL statement is too long because of the excessive content to concatenate, which leads to the I/O error. Therefore, when the amount of data is too large, it will make the splicing too long, resulting in program error.
And the length of the SQL depends not only on the amount of data, but also on the number of insert parameters in the insert statement. In fact, the length of SQL and the product of the two linear change positive correlation, so when the insert parameter is too much to control the volume of batch insert data size.
So what’s the solution? The simplest method is batch inserts. This is similar to the method used in the for loop, except that the inserts in the for loop are batch inserts.
Since this is not a one-time insert, you need to wrap the transaction to ensure that whatever insert goes wrong will be rolled back.
@Transactional
public Result testMassiveInsert(a) {
long startTime = System.currentTimeMillis(); // Get the start time
List<HashMap> list = new LinkedList<>();
// Get a list with a length of 500 * 80 = 40000
for (int i = 0; i < maxInsertItemNumPerTime * 80; i++) {
HashMap map = new HashMap();
map.put("value1"."value1" + i);
map.put("value2"."value2" + i);
list.add(map);
}
// Batch inserts
try {
if (list.size() > maxInsertItemNumPerTime) {
List<List<HashMap>> all = new ArrayList<>();
int i = 0;
while(i < list.size()) { List subList = list.subList(i, i + maxInsertItemNumPerTime); i = i + maxInsertItemNumPerTime; all.add(subList); } all.parallelStream().forEach(o -> mapper.testMassiveInsert(o)); }}catch (RuntimeException e) {
log.info("Batch insertion" + list.size() + "Failure", e.getMessage());
throw new RuntimeException(e);
}
long endTime = System.currentTimeMillis(); // Get the end time
return Result.ok().message("Program runtime:" + (endTime - startTime) + "ms");
}
Copy the code
We control the length of each batch insert by setting a maxInsertItemNumPerTime.
3. Simple tests
Here is my simple test (that is, insert the total number of 4W, but set different maxInsertItemNumPerTime size, calculate the comparison program run time). However, this test did not run many times to average and there may be network jitter, so it was a simple test.
- 2000
- 1000
- 500
- 250
In the end I chose 500 as the size of each batch for batch insertion. As we mentioned above, even with connection reuse provided by database connection pooling, too much interaction with the database can still cause performance degradation, so maxInsertItemNumPerTime is not always better.
At the same time, with the increase of maxInsertItemNumPerTime, the processing time of SQL in each for loop (SQL concatenation) will increase, and this increase is not linear, but tends to increase exponentially (check some data to confirm my guess). Otherwise it wouldn’t have been much bigger than 500 at 2000.
In the actual business, you also need simple testing to select a more appropriate value, which is better than no testing.
Fourth, do some expansion
In fact, the official documentation of Mybatis provides another way to support batch inserts.
However, since the company uses Mapper scanning to operate the database in its projects, and such big data insertion scenarios are really few, it does not deliberately introduce the following methods provided by Mybatis.
A multiple row insert is a single insert statement that inserts multiple rows into a table. This can be a convenient way to insert a few rows into a table, but it has some limitations:
- Since it is a single SQL statement, you could generate quite a lot of prepared statement parameters. For example, suppose you wanted to insert 1000 records into a table, and each record had 5 fields. With a multiple row insert you would generate a SQL statement with 5000 parameters. There are limits to the number of parameters allowed in a JDBC prepared statement – and this kind of insert could easily exceed those limits. If you want to insert many records, you should probably use a JDBC batch insert instead (see below)
- The performance of a giant insert statement may be less than you expect. If you have many records to insert, it will almost always be more efficient to use a JDBC batch insert (see below). With a batch insert, the JDBC driver can do some optimization that is not possible with a single large statement
- Retrieving generated values with multiple row inserts can be a challenge. MyBatis currently has some limitations related to retrieving generated keys in multiple row inserts that require special considerations (see below)
Nevertheless, there are use cases for a multiple row insert – especially when you just want to insert a few records in a table and don’t need to retrieve generated keys. In those situations, a multiple row insert will be an easy solution.
Translation:
try (SqlSession session = sqlSessionFactory.openSession()) {
GeneratedAlwaysAnnotatedMapper mapper = session.getMapper(GeneratedAlwaysAnnotatedMapper.class);
List<GeneratedAlwaysRecord> records = getRecordsToInsert(); // not shown
MultiRowInsertStatementProvider<GeneratedAlwaysRecord> multiRowInsert = insertMultiple(records)
.into(generatedAlways)
.map(id).toProperty("id")
.map(firstName).toProperty("firstName")
.map(lastName).toProperty("lastName")
.build()
.render(RenderingStrategies.MYBATIS3);
int rows = mapper.insertMultiple(multiRowInsert);
}
Copy the code
This is the sample code in the documentation.
Fifth, further optimize the code
Since what I’m actually doing in my company is importing data into multiple tables, if I use the batching method above, then each of my inserts will have the following logic:
Try {if (list.size() > maxInsertItemNumPerTime) {list < list <HashMap>> all = new ArrayList<>(); int i = 0; while (i < list.size()) { List subList = list.subList(i, i + maxInsertItemNumPerTime); i = i + maxInsertItemNumPerTime; all.add(subList); } all.parallelStream().forEach(o -> mapper.testMassiveInsert(o)); }} catch (RuntimeException e) {log.info(LLDB + list.size() + 'failure', LLDB message ()); throw new RuntimeException(e); }Copy the code
Obviously this smells a little bit like repetitive code, and it doesn’t look good. So let’s do a simple wrapper, wrap this code in a method that will be called by whoever inserts it.
First, what arguments does this code need to pass in?
maxInsertItemNumPerTime
We don’t need to pass it in, because obviously this is a constantlist
The list of inserted contents needs to be passed in, and not all types areHashMap
Instead, different tables correspond to different entity classes. You need genericsT
.mapper
In thetestMassiveInsert(HashMap map)
Methods are obviously different from table insertsmapper
It’s definitely not the same method, so this also needs to be passed in, if you pass in a method as a parameter, then you need to useLambda expressions and functional programming
. If you’ve learned about functional interfaces, it’s natural to think that functions with only input and no output should be defined byConsumer
To pass in (and vice versaSupplier
When there are inputs and outputsFunction
).
So the final code abstraction should look like this:
public <T> void batchSplitInsert(List<T> list, Consumer insertFunc) {
List<List<T>> all = new ArrayList<>();
if (list.size() > maxInsertItemNumPerTime) {
int i = 0;
while (i < list.size()) {
List subList = list.subList(i, i + maxInsertItemNumPerTime);
i = i + maxInsertItemNumPerTime;
all.add(subList);
}
all.parallelStream().forEach(insertFunc);
} else{ insertFunc.accept(list); }}Copy the code
This is how I insert different tables:
List<TableDTO> needToInsert =...... ; // Add Consumer<List<TableDTO>> Consumer = o -> mapper.inserttable (o); batchSplitInsert(needToInsert, consumer);Copy the code
Now the whole world is graceful! If you are new to Java or just learning Java, you might as well start with simple Lambda statements. This is really a feature of Java8 that is super easy to use.
So far, this article is over, you have any questions, please comment, let’s communicate with each other.
If this article is helpful to you, please give it a thumbs up. Thanks! If the big guy has a better way to achieve it, please be sure to give advice!