specifications
There is a requirement for Excel import in the project: import of payment records
The implementation/user will fill the data of other systems into the Excel template in our system, and the application will read, proofread and convert the file content to generate overdue data, bills and bill details and store them in the database.
Before I took over, I probably did not have a high pursuit of efficiency due to the small amount of data imported before. In version 4.0, however, I expect to import at the level of 10W + Excel rows and insert more than 3N rows into the database. That is, 10W rows of Excel will be inserted into the database at least 30W rows. So optimizing the original import code is imperative. I gradually analyzed and optimized the imported code to complete it within 100 seconds. (The final performance bottleneck was the processing speed of the database. The 4g memory of the test server not only put the database, but also put a lot of microservice applications. Not very good at processing). The detailed process is as follows. At each step, problems affecting performance are listed and solutions are provided.
The need to import Excel is still very common in the system, and my optimization method may not be optimal. Readers are welcome to provide better ideas in the comments section
Some of the details
-
Data import: The template is provided by the system. The template format is XLSX (supporting 65535+ rows of data). Users write corresponding data in corresponding columns based on the table header
-
Data verification: There are two types of data verification:
- Field length and field regular expression verification. There is no external data interaction in in-memory verification. The impact on performance is small
- Data repeatability check, such as whether the ticket number is the same as the existing ticket number in the system (the database needs to be queried, which affects performance)
-
Insert data: the test environment uses MySQL 5.7 for database, undivided database and tables, and Druid for connection pool
Iteration record
First edition: POI + line-by-line query proofreading + line-by-line insertion
This version, the oldest, uses native POI to manually map Excel rows to ArrayList objects and store them to List. The code performs the following steps:
- Manually read Excel into a List
- Loop through, in which the following steps are performed
- Check field length
- Some query database verification, such as the verification of the current line arrears corresponding to the house exists in the system, need to query the house table
- Writes the current row data
- Returns the result of execution, if error/verification fails. A prompt message is displayed and data is rolled back
Obviously, this implementation must have been rushed out, and the subsequent use may have been minimal with no discernable performance issues, but it works best for single-digit/ten-digit levels of data. There are the following obvious problems:
- The database is queried once for each row of data. The NETWORK I/O times for applications to access the database are multiplied by n times, and the time is also multiplied by N times
- Data is also written line by line, and the problem is the same as above
- Data reading using native POI, code is very redundant, poor maintainability.
Second edition: EasyPOI + Cache database query operations + bulk inserts
In view of the three problems analyzed in the first edition, the following three methods are used to optimize them respectively
Cache data, trading space for time
The time cost of checking database line by line is mainly in the network IO back and forth, and the optimization method is very simple. All validation data is cached in the HashMap. Go directly to the HashMap.
For example, if the house in the verification row exists, the area + building + unit + room number is used to query the house table to match the house ID. If the house ID is found, the verification succeeds. The house ID is stored in the generated bill of default. The housing information will not be updated when it is imported into arrears. In addition, the housing information of a community is not very much (within 5000), so I use an SQL to store all the houses in the community in the area/building/unit/room number as key and the house ID as value into HashMap. Subsequent verification only needs to be hit in HashMap
Custom SessionMapper
Mybatis natively does not support writing query results directly into a HashMap, requires custom SessionMapper
SessionMapper specifies the result set of SQL queries to be processed using the MapResultHandler
@Repository public class SessionMapper extends SqlSessionDaoSupport { @Resource public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) { super.setSqlSessionFactory(sqlSessionFactory); + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Long> getHouseMapByAreaId(Long areaId) { MapResultHandler handler = new MapResultHandler(); this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler); Map<String, Long> map = handler.getMappedResults(); return map; }}Copy the code
MapResultHandler handler that puts the result set into the HashMap
public class MapResultHandler implements ResultHandler { private final Map mappedResults = new HashMap(); @Override public void handleResult(ResultContext context) { @SuppressWarnings("rawtypes") Map map = (Map)context.getResultObject(); mappedResults.put(map.get("key"), map.get("value")); } public Map getMappedResults() { return mappedResults; }}Copy the code
Sample Mapper
@mapper@repository public interface BaseUnitMapper {// Long> getHouseMapByAreaId(@Param("areaId") Long areaId); }Copy the code
Sample Mapper XML
<select id="getHouseMapByAreaId" resultMap="mapResultLong">
SELECT
CONCAT( h.bulid_area_name, h.build_name, h.unit_name, h.house_num ) k,
h.house_id v
FROM
base_house h
WHERE
h.area_id = #{areaId}
GROUP BY
h.house_id
</select>
<resultMap id="mapResultLong" type="java.util.HashMap">
<result property="key" column="k" javaType="string" jdbcType="VARCHAR"/>
<result property="value" column="v" javaType="long" jdbcType="INTEGER"/>
</resultMap>
Copy the code
Then call the method corresponding to the SessionMapper class in your code.
Batch insert using values
MySQL insert statement supports values (),(),() to insert multiple rows of data at one time. Mybatis foreach can insert multiple rows of data at one time.
<insert id="insertList">
insert into table(colom1, colom2)
values
<foreach collection="list" item="item" index="index" separator=",">
( #{item.colom1}, #{item.colom2})
</foreach>
</insert>
Copy the code
Use EasyPOI to read and write Excel
EasyPOI uses annotation-based import and export, so you can modify Excel by modifying annotations, which is very convenient and easy to maintain code.
Third edition: EasyExcel + cache database query operations + batch inserts
After adopting EasyPOI in the second version, thousands or tens of thousands of Excel data can be easily imported, but it takes a little time (5W data is written to the database in about 10 minutes). However, since the operation of import is basically developed while looking at the log while importing, there is no further optimization. However, the good times don’t last long, there is a new community to move in, the bill Excel has 41W lines, at this time, using EasyPOI in the development environment run directly OOM, after increasing the JVM memory parameters, although not OOM, However, the CPU usage is 100% for 20 minutes and all data is still not read successfully. Therefore, it is necessary to optimize the speed when reading large Excel. Do you want me to go deep into POI optimization? Don’t panic. Check GITHUB for other open source projects. Then Ali EasyExcel came into view:
Emmm, I thought this was made for me! Give it a try. EasyExcel uses similar annotations to EasyPOI to read and write Excel, so switching from EasyPOI is easy and can be done in minutes. It is also true as Described by Ali: the average reading time of 41W rows, 25 columns and 45.5m data is 50, so it is recommended to use EasyExcel for reading large Excel.
Fourth edition: Optimize data insertion speed
In the second version of insert, I used values batch inserts instead of row by row inserts. Concatenate a long SQL with sequential inserts every 30,000 rows. The whole import method is the most time-consuming, very time-consuming. Later, I reduced the number of lines to 10000, 5000, 3000, 1000 and 500 at a time and found that 1000 was the fastest. I guess innodb_buffer_pool_size is too long for a write operation and disk swap occurred due to exceeding the memory threshold. The speed was limited, and the database performance of the test server was not very good, and he could not handle too many inserts. So we ended up with 1000 inserts at a time.
After 1000 inserts each time, in order to drain the CPU of the database, then the network IO waiting time needs to be used, this requires multi-threading to solve, and the simplest multi-threading can use parallel flow to achieve, then I will code with parallel flow to test:
10W rows of Excel, 42W back orders, 42W record details, 2W record, 16 threads parallel insert database, 1000 rows at a time. The insertion time is 72s, and the total import time is 95 s.
Parallel insert utility classes
The code for parallel insertion I have encapsulated a functional programming tool class, also provided to you
/** * function: ** @author Keats * @date 2020/7/1 9:25 */ public class InsertConsumer {** ** */ private final static int SIZE = 1000; /** * If you need to adjust the number of concurrent requests, Modify the following methods of the second parameter can be * / static {System. SetProperty (" java.util.concurrent.ForkJoinPool.com mon. Parallelism ", "4"); } /** * insert method ** @param list insert data set * @param consumer, @param <T> public static <T> void insertData(List<T> List, Consumer<List<T>> consumer) { if (list == null || list.size() < 1) { return; } List<List<T>> streamList = new ArrayList<>(); for (int i = 0; i < list.size(); i += SIZE) { int j = Math.min((i + SIZE), list.size()); List<T> subList = list.subList(i, j); streamList.add(subList); } // The number of concurrent threads used by parallel streams is the number of CPU cores and cannot be changed locally. Influenced by global change, consider streamList. ParallelStream (). The forEach (consumer); }}Copy the code
Most of these use a lot of Java8 apis, so if you don’t know, check out my previous Blog about Java. The method is simple to use
InsertConsumer.insertData(feeList, arrearageMapper::insertList);
Copy the code
Other items that affect performance
The log
Avoid printing too many INFO logs in the for loop
In the process of tuning, I also found one thing that affected performance in particular: For info logs, 41w rows, 25 columns, and 45.5m data are still used. An info log is printed every 1000 rows between the beginning and the end of data reading. For cache verification data, 3+ info logs are printed each row between the end of data reading. Print and persist to disk. The following is the difference in efficiency between printing logs and not printing logs
Print log
Do not Print logs
I thought I had chosen the wrong Excel file, so I chose it again, and the result was the same
Cache verification data – It takes only 1/10 of the time to print logs when the verification is complete.
conclusion
Methods to improve Excel import speed:
- Use faster Excel reading framework (Ali EasyExcel is recommended)
- For validation that requires interaction with the database, use the cache appropriately according to the business logic. Trade space for time
- Concatenate long SQL with values(),(),() to insert multiple rows at a time
- Use multithreading to insert data, take advantage of network IO wait time (parallel stream is recommended, easy to use)
- Avoid printing useless logs in loops