The premise

This article is not a title party, the following will analyze how to optimize the million-level data Excel export through a simulation example.

The data query and data export service maintained by the author is a relatively ancient single point application. After the last cloud migration, the service was expanded to dual-node deployment. However, it was found that the service was often Full GC frequently due to the data export of large amount of data, resulting in the application suspended suspended and unable to respond to external requests. For some reason, the service is only able to “allocate 2GB of maximum heap memory”, and the following optimizations are based on this heap memory limit. Two problems were identified by looking at the service configuration, logs, and APM:

  1. Added to the startup scriptCMSParameter, adoptedCMSCollector, the collection algorithm is more sensitive to memory, large quantities of data export is easy to instantly hit the old ageFull GCHappens frequently.
  2. When data is exported, all the target data is queried and then written to the stream. A large number of queried objects reside in the heap memory and directly fill the heap.

-xx :+UseParallelGC -xx :+UseParallelGC -xx :+UseParallelGC Be insane and restart the service. Observing the APM tool shows that the frequency of Full GC is decreased, but once the amount of data exported is very large (for example, the query results exceed 1 million objects, which exceeds the maximum heap memory available), the Full GC is still stuck endlessly, which means that modifying JVM parameters is only a temporary solution. Therefore, the following will focus on this problem (i.e., problem 2) and analyze how to optimize it through a simulation case.

Some basic principles

If you use Java (or a JVM-dependent language) to develop modules that export data, the following pseudocode is common:

Data export method (arguments, [OutputStream]){    1. Query the result set to be exported by parameter    2Serializes the result set to a sequence of bytes    3. Writes a sequence of result set bytes through an output stream    4. Close the output stream} Copy the code

Here’s an example:

@Data
public static class Parameter{
    
    private OffsetDateTime paymentDateTimeStart;
    
 private OffsetDateTime paymentDateTimeEnd; }  public void export(Parameter parameter, OutputStream os) throws IOException {  List<OrderDTO> result =  orderDao.query(parameter.getPaymentDateTimeStart(), parameter.getPaymentDateTimeEnd()).stream()  .map(order -> {  OrderDTO dto = new OrderDTO(); . return dto;  }).collect(Collectors.toList());  byte[] bytes = toBytes(result);  os.write(bytes);  os.close(); } Copy the code

Depending on the implementation of an OutputStream, you can eventually export data to different types of targets. For example, a FileOutputStream is equivalent to exporting data to a file. For SocketOutputStream, it is equivalent to exporting data to a network stream (the client can read the stream implementation file download). At present, the common file export of B-side applications is implemented in the latter way. The basic interaction process is as follows:

In order to save server memory, the returned data and data transfer part can be designed to be segmtioned, that is, when querying the full result, consider the idea of querying only part of the data at a time, until the full data is obtained, each batch of query results are written into the OutputStream.

In this case, we can use the idea similar to paging query, but due to the low efficiency of LIMIT offset,size, combined with some previous practices, we adopted a “improved” scrolling page “implementation method (this method is the idea of a former company’s architecture team, Note that this solution requires that the table contain a primary key with auto-increment trend, and that a single SQL query is as follows:

SELECT * FROM tableX WHERE id > Other conditions # {lastBatchMaxId} [] ORDER BY id [ASC | DESC] (ASC) is generally selected here LIMIT ${size}
Copy the code

Putting the SQL above into the previous example, and assuming the order table uses a self-incrementing integer primary key ID, the code above would look like this:

public void export(Parameter parameter, OutputStream os) throws IOException {
    long lastBatchMaxId = 0L;
    for(;;) {        List<Order> orders =  orderDao.query([SELECT * FROM t_order WHERE id > #{lastBatchMaxId} 
        AND payment_time >= #{parameter.paymentDateTimeStart} AND payment_time <= #{parameter.paymentDateTimeEnd} ORDER BY id ASC LIMIT ${LIMIT}]);
 if (orders.isEmpty()){  break;  }  List<OrderDTO> result =  orderDao.query([SELECT * FROM t_order]).stream()  .map(order -> {  OrderDTO dto = new OrderDTO(); . return dto;  }).collect(Collectors.toList());  byte[] bytes = toBytes(result);  os.write(bytes);  os.flush();  lastBatchMaxId = orders.stream().map(Order::getId).max(Long::compareTo).orElse(Long.MAX_VALUE);  }  os.close(); } Copy the code

“The above example is the core idea of optimization for million-level data Excel export.” The logic for the query and the output stream is written in an infinite loop, because the query results are sorted using the increment primary key, and the lastBatchMaxId property holds the maximum ID in the query result set and the starting ID of the next batch of queries, which is equivalent to scrolling forward based on the ID and the query condition. Until the query condition does not match any record returns an empty list will exit the infinite loop. And limit field is used to control the number of each batch query, can be applied in accordance with the actual distribution of memory and the amount of each batch query data consideration to design a reasonable value, so you can make a single request permanent memory number of objects under control in limit so that the application of memory usage is more controllable, avoid because of concurrent export lead to heap memory moment was played.

The scrolling paging scheme here is far more efficient than the LIMIT offset,size, because each query is the final result set, whereas the normal paging scheme uses LIMIT offset,size, which needs to be queried first and truncated later.

The simulation case

An application provides the function of querying orders and exporting records. The table design is as follows:

DROP TABLE IF EXISTS `t_order`;

CREATE TABLE `t_order`
(
    `id`           BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT 'primary key'. `creator` VARCHAR(16) NOT NULL DEFAULT 'admin' COMMENT 'Founder'. `editor` VARCHAR(16) NOT NULL DEFAULT 'admin' COMMENT 'Modifier'. `create_time` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Creation time'. `edit_time` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Modify time'. `version` BIGINT NOT NULL DEFAULT 1 COMMENT 'Version number'. `deleted` TINYINT NOT NULL DEFAULT 0 COMMENT 'Soft delete id'. `order_id` VARCHAR(32) NOT NULL COMMENT 'order ID'. `amount` DECIMAL(10.2) NOT NULL DEFAULT 0 COMMENT 'Order amount'. `payment_time` DATETIME NOT NULL DEFAULT '1970-01-01 00:00:00' COMMENT 'Pay time'. `order_status` TINYINT NOT NULL DEFAULT 0 COMMENT 'Order status,0: processing,1: payment successful,2: payment failed'. UNIQUE uniq_order_id (`order_id`),  INDEX idx_payment_time (`payment_time`) ) COMMENT 'Order Form'; Copy the code

Now we need to export a batch of order data based on the payment time. First, write a simple SpringBoot application based on this requirement. The Excel processing tool here uses Alibaba EsayExcel, which mainly relies on the following:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
 <groupId>org.springframework.boot</groupId>  <artifactId>spring-boot-starter-jdbc</artifactId> </dependency> <dependency>  <groupId>mysql</groupId>  <artifactId>mysql-connector-java</artifactId>  <version>8.0.18</version> </dependency> <dependency>  <groupId>com.alibaba</groupId>  <artifactId>easyexcel</artifactId>  <version>2.2.6</version> </dependency> Copy the code

Simulated write 200W data, generated data test class is as follows:

public class OrderServiceTest {

    private static final Random OR = new Random();
    private static final Random AR = new Random();
    private static final Random DR = new Random();
  @Test  public void testGenerateTestOrderSql(a) throws Exception {  HikariConfig config = new HikariConfig();  config.setUsername("root");  config.setPassword("root");  config.setJdbcUrl("jdbc:mysql://localhost:3306/local? serverTimezone=Asia/Shanghai&useUnicode=true&characterEncoding=UTF-8&zeroDateTimeBehavior=convertToNull&useSSL=false");  config.setDriverClassName(Driver.class.getName());  HikariDataSource hikariDataSource = new HikariDataSource(config);  JdbcTemplate jdbcTemplate = new JdbcTemplate(hikariDataSource);  for (int d = 0; d < 100; d++) {  String item = "('%s','%d','2020-07-%d 00:00:00','%d')";  StringBuilder sql = new StringBuilder("INSERT INTO t_order(order_id,amount,payment_time,order_status) VALUES ");  for (int i = 0; i < 20 _000; i++) {  sql.append(String.format(item, UUID.randomUUID().toString().replace("-".""),  AR.nextInt(100000) + 1, DR.nextInt(31) + 1, OR.nextInt(3))).append(",");  }  jdbcTemplate.update(sql.substring(0, sql.lastIndexOf(",")));  }  hikariDataSource.close();  } } Copy the code

Write DAO class OrderDao based on JdbcTemplate:

@RequiredArgsConstructor
@Repository
public class OrderDao {

    private final JdbcTemplate jdbcTemplate;
  public List<Order> queryByScrollingPagination(long lastBatchMaxId,  int limit,  LocalDateTime paymentDateTimeStart,  LocalDateTime paymentDateTimeEnd) {  return jdbcTemplate.query("SELECT * FROM t_order WHERE id > ? AND payment_time >= ? AND payment_time <= ? " +  "ORDER BY id ASC LIMIT ?". p -> {  p.setLong(1, lastBatchMaxId);  p.setTimestamp(2, Timestamp.valueOf(paymentDateTimeStart));  p.setTimestamp(3, Timestamp.valueOf(paymentDateTimeEnd));  p.setInt(4, limit);  },  rs -> {  List<Order> orders = new ArrayList<>();  while (rs.next()) {  Order order = new Order();  order.setId(rs.getLong("id"));  order.setCreator(rs.getString("creator"));  order.setEditor(rs.getString("editor"));  order.setCreateTime(OffsetDateTime.ofInstant(rs.getTimestamp("create_time").toInstant(), ZoneId.systemDefault()));  order.setEditTime(OffsetDateTime.ofInstant(rs.getTimestamp("edit_time").toInstant(), ZoneId.systemDefault()));  order.setVersion(rs.getLong("version"));  order.setDeleted(rs.getInt("deleted"));  order.setOrderId(rs.getString("order_id"));  order.setAmount(rs.getBigDecimal("amount"));  order.setPaymentTime(OffsetDateTime.ofInstant(rs.getTimestamp("payment_time").toInstant(), ZoneId.systemDefault()));  order.setOrderStatus(rs.getInt("order_status"));  orders.add(order);  }  return orders;  });  } } Copy the code

Write the service class OrderService:

@Data
public class OrderDTO {

    @ExcelIgnore
    private Long id;
  @ExcelProperty(value = "Order Number", order = 1)  private String orderId;  @ExcelProperty(value = "The amount", order = 2)  private BigDecimal amount;  @ExcelProperty(value = "Time to Pay", order = 3)  private String paymentTime;  @ExcelProperty(value = "Order status", order = 4)  private String orderStatus; }  @Service @RequiredArgsConstructor public class OrderService {   private final OrderDao orderDao;   private static final DateTimeFormatter F = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");   public List<OrderDTO> queryByScrollingPagination(String paymentDateTimeStart,  String paymentDateTimeEnd,  long lastBatchMaxId,  int limit) {  LocalDateTime start = LocalDateTime.parse(paymentDateTimeStart, F);  LocalDateTime end = LocalDateTime.parse(paymentDateTimeEnd, F);  return orderDao.queryByScrollingPagination(lastBatchMaxId, limit, start, end).stream().map(order -> {  OrderDTO dto = new OrderDTO();  dto.setId(order.getId());  dto.setAmount(order.getAmount());  dto.setOrderId(order.getOrderId());  dto.setPaymentTime(order.getPaymentTime().format(F));  dto.setOrderStatus(OrderStatus.fromStatus(order.getOrderStatus()).getDescription());  return dto;  }).collect(Collectors.toList());  } } Copy the code

Finally, write the controller OrderController:

@RequiredArgsConstructor
@RestController
@RequestMapping(path = "/order")
public class OrderController {

 private final OrderService orderService;   @GetMapping(path = "/export")  public void export(@RequestParam(name = "paymentDateTimeStart") String paymentDateTimeStart,  @RequestParam(name = "paymentDateTimeEnd") String paymentDateTimeEnd,  HttpServletResponse response) throws Exception {  String fileName = URLEncoder.encode(String.format("%s-(%s).xlsx"."Order Payment Data", UUID.randomUUID().toString()),  StandardCharsets.UTF_8.toString());  response.setContentType("application/force-download");  response.setHeader("Content-Disposition"."attachment; filename=" + fileName);  ExcelWriter writer = new ExcelWriterBuilder()  .autoCloseStream(true)  .excelType(ExcelTypeEnum.XLSX)  .file(response.getOutputStream())  .head(OrderDTO.class)  .build();  // The upper limit of an XLSX file is about 104W. If the number exceeds 104W, separate sheets are required  WriteSheet writeSheet = new WriteSheet();  writeSheet.setSheetName("target");  long lastBatchMaxId = 0L;  int limit = 500;  for(; ;) { List<OrderDTO> list = orderService.queryByScrollingPagination(paymentDateTimeStart, paymentDateTimeEnd, lastBatchMaxId, limit);  if (list.isEmpty()) {  writer.finish();  break;  } else {  lastBatchMaxId = list.stream().map(OrderDTO::getId).max(Long::compareTo).orElse(Long.MAX_VALUE);  writer.write(list, writeSheet);  }  }  } } Copy the code

Here for convenience, part of the business logic code in the controller layer to write, in fact, this is not standard coding habits, this point do not follow. After add configuration and start the class, through the request to http://localhost:10086/order/export? PaymentDateTimeStart =2020-07-01 00:00:00&paymentDateTimeEnd=2020-07-16 00:00:00

Data export time :29733 ms,start:2020-07-01 00:00:00,end:2020-07-16 00:00:00Copy the code

After exporting successfully, you get a file (1031540 lines with table headers) :

summary

This article takes a detailed look at performance optimizations for large data exports, with a particular focus on memory optimizations. Under the premise of occupying as little memory as possible, the scheme can export large quantities of data in an acceptable range of efficiency. This is a reusable solution, and similar design ideas can be applied to other fields or scenarios, not limited to data export.

The warehouse address of the Demo project is:

  • Github: https://github.com/zjcscut/spring-boot-guide/tree/master/ch10086-excel-export.

(C-2-D E-A-20200711 20:27 PM)