Interviewer: How to query tens of millions of data through MyBatis and ensure that the memory does not overflow?

Due to the mature use of ORM framework, many partners are weak in the concept of JDBC. The bottom layer of ORM framework is actually DB through JDBC operation

JDBC (JavaDataBase Connectivity) is a Java database connection, which is to use the Java language to operate the database

SUN provides a set of standard API for accessing database, and provides the corresponding protocol standard for connecting database, and then each manufacturer provides a set of API interface for accessing their own database according to the specification

01 MySql JDBC Large Data Operation

The whole article takes large data operation as the topic, and elicits related knowledge points through the requirements in the development process

Migrating data
Export data
Batch processing data

Generally speaking, I think in Java Web programs, can be called large data volume, ranging from hundreds of thousands to tens of millions, and then Java (Web application) processing is not how appropriate

For example, the business system needs to read 500W rows from the MySQL database for processing. What should it do

Regular queries, reading 500W data into JVM memory at a time, or paging reads
Stream queries, establish long connections, use server-side cursors, and load them one at a time into JVM memory
Cursor queries, like streaming, use the fetchSize parameter to control how many pieces of data are read at a time

1.1 General Query

By default, the complete set of retrieval results is stored in memory. In most cases, this is the most efficient way to operate, and is easier to implement because of the design of the MySQL network protocol

Assuming 500W data in a single table, no one will load it into memory at a time, usually paging

@SneakyThrows @Override public void pageQuery() { @Cleanup Connection conn = dataSource.getConnection(); @Cleanup Statement stmt = conn.createStatement(); long start = System.currentTimeMillis(); long offset = 0; int size = 100; while (true) { String sql = String.format("SELECT COLUMN_A, COLUMN_B, COLUMN_C FROM YOU_TABLE LIMIT %s, %s", offset, size); @Cleanup ResultSet rs = stmt.executeQuery(sql); long count = loopResultSet(rs); if (count == 0) break; offset += size; } log.info(" paginate query time :: {} ", system.currentTimemillis () -start); }Copy the code

The above approach is simple, but without LIMIT deep paging optimization, the online database server gets cold, or you can wait a few days to retrieve the data

1.2 Streaming Query

If you are using a ResultSet with a large number of rows and cannot allocate the required memory heap space for it in the JVM, you can tell the driver to return a row from the result stream

Streaming queries have one caveat: all rows in the result set must be read (or closed) before any additional queries can be issued against the connection, or an exception will be thrown

With streaming queries, you want to maintain concurrent access to the tables referenced by the statement that produced the result set, which must be processed as quickly as possible because the query monopolizes the join

@SneakyThrows public void streamQuery() { @Cleanup Connection conn = dataSource.getConnection(); @Cleanup Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(Integer.MIN_VALUE); long start = System.currentTimeMillis(); @Cleanup ResultSet rs = stmt.executeQuery("SELECT COLUMN_A, COLUMN_B, COLUMN_C FROM YOU_TABLE"); loopResultSet(rs); Log.info (" Stream query time :: {} ", (system.currentTimemillis () -start) / 1000); }Copy the code

Flow query database table data volume 500W Single call time consumption: ≈ 6s

1.3 Cursor Query

In SpringBoot 2.x, the default connection pool is HikariPool and the connection object is HikariProxyConnection, so the following cursor setting is not feasible

((JDBC4Connection) conn).setUseCursorFetch(true);
Copy the code

Need to concatenate &usecursorFetch =true in the database connection information. Second, set the number of data that Statement reads at a time, for example, 1000

@SneakyThrows public void cursorQuery() { @Cleanup Connection conn = dataSource.getConnection(); @Cleanup Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(1000); long start = System.currentTimeMillis(); @Cleanup ResultSet rs = stmt.executeQuery("SELECT COLUMN_A, COLUMN_B, COLUMN_C FROM YOU_TABLE"); loopResultSet(rs); Log.info (" Cursor query time :: {} ", (system.currentTimemillis () -start) / 1000); }Copy the code

Cursor query database table data volume 500W Single call time consumption: ≈ 18s

1.4 the JDBC RowData

The method loopResultSet is used above, and the while loop is only carried out inside the method. The core point of conventional, streaming and cursor query is the next method

@sneakythrows private Long loopResultSet(ResultSet rs) {while (rs.next()) {// Business operation} return xx; @sneakythrows private Long loopResultSet(ResultSet rs) {while (rs.next()) {// Business operation} return xx; }Copy the code

The logic of resultSet.next () is to implement the class ResultSetImpl to fetch the next row from RowData each time. RowData is an interface and the implementation diagram is shown below

By default, a ResultSet uses a RowDataStatic instance. When a RowDataStatic object is generated, all records in the ResultSet are read into memory, and then read from memory one by one through next()

The call to RowDataCursor is batched and then internally cached as follows:

It first checks to see if there is any data in its internal buffer that has not been returned, and returns the next row if there is
If all the data is read, a new request to MySQL Server is triggered to read the fetchSize result
Buffers the result into an internal buffer, and then returns the first row of data

When using streaming, a ResultSet uses a RowDataDynamic object, which next() initiates an IO to read a single row of data each time it is called

To summarize, the default RowDataStatic reads all data into client memory, which is our JVM; RowDataCursor reads the fetchSize row once, and then initiates the request after consuming it. RowDataDynamic Reads one piece of data per IO call

1.5 JDBC Communication Mechanism

(1) General query

The interaction between JDBC and MySQL server is done through Socket, corresponding to network programming, MySQL can be treated as a SocketServer, so a complete request link should be:

JDBC client -> client Socket -> MySQL -> retrieve data return -> MySQL kernel Socket Buffer -> Network -> client Socket Buffer -> JDBC client

When querying a large amount of data, the JVM may be cold for the following reasons:

MySQL Server writes the retrieved SQL result set to the kernel Socket Buffer via the output stream
The kernel buffer sends back data through the TCP link initiated by JDBC. In this case, the data is first entered into the kernel buffer where the JDBC client resides
When JDBC initiates an SQL operation, the program blocks on the read operation of the input stream and wakes up to read the buffer data into JVM memory when there is data in the buffer
MySQL Server continues to send data, and JDBC continues to read buffer data into Java memory. Although the data is stored in the JDBC program, JDBC does not respond to the execute method call because it does not return until the corresponding data is read
The downside is obvious, if the query volume is too large, you will constantly go through GC and then run out of memory

(2) Cursor query

According to the above, the cursor can solve the memory overflow problem of ordinary query with large data volume, but

MySQL does not know when the client has finished consuming, so another connection causes a DML write to the table.

In fact, when we use the cursor query, MySQL needs to create a temporary space to hold the data that needs to be read, so there is no conflict with DML write operations

But cursor queries can cause the following:

The IOPS spikes because the data to be returned needs to be written into the temporary space, resulting in a large number of I/O reads and writes. This process may cause write jitter of other services
Disk space spikes because the data written to the temporary space is outside the original table. If the table data is too large, the database disk may be filled in extreme cases, and the network output does not change. Data written to temporary space is retrieved by MySQL when the read is complete or when the client initiates a ResultSet#close operation
When a JDBC client initiates an SQL query, it may take a long time to wait for an SQL response. During this period, the server prepares data. However, the common query waiting time is inconsistent with the cursor query waiting time in principle. The former is consistent in reading the data from the network buffer and does not respond to the business level. The latter is MySQL preparing temporary data space and does not respond to JDBC
After data is prepared, the network response starts to surge during data transmission. IOPS changes from “read/write” to “read”.

The communication efficiency of cursor query is low. After the client consumes fetchSize row data, it needs to send a request to the server. Therefore, the IOPS is very high and occupies a large amount of disk space and performance during the database preparation phase

(3) Streaming query

When the client and MySQL Server establish a connection and exchange queries, MySQL Server will return the SQL result set to output through the output stream, that is, write data to the Socket Buffer corresponding to the local kernel. The data in the kernel is then sent back over the TCP link to the server kernel buffer corresponding to JDBC

JDBC uses the input stream read method to read kernel buffer data. Since streaming reading is enabled, the business program receives only one piece of data at a time
The MySQL server will stream data to the client kernel represented by JDBC until the client requests the Socket buffer to be full, at which point the MySQL server will block
Every time for JDBC client, the data is read from the kernel buffer of this machine, so the performance will be quicker, generally don’t have to worry about the machine kernel countless according to consumption (unless the MySQL server data, the client does not do any business logic, to get the data directly to give up, advance consumption than the client server will happen)

Streaming may seem better than the cursor approach, but things are often not as simple as they appear

Streaming has a longer impact on the database than cursor queries
In addition, streaming query depends on network, which leads to network congestion

02 stream cursor memory analysis

Table data volume: 500W

Memory viewing tool: JDK comes with Jvisualvm

To set JVM parameters: -xmx512m -xMS512m

2.1 Memory Usage for a single call

The streaming memory performance report is as follows

The cursor query memory performance report is as follows

Based on memory usage, both cursor queries and streaming queries are good protection against OOM

2.2 Memory usage of concurrent Calls

Concurrent calls: Jmete calls 10 threads in a second concurrently

The streaming memory performance report is as follows

Concurrent calls are also fine for memory usage, with no superimposition

Average consumption of concurrent call time for streaming query: ≈ 55s

The cursor query memory performance report is as follows

Average consumption of cursor query concurrent call time: ≈ 83s

Due to equipment limitations, and some cases can only occur in extreme conditions, so there is no production and test multi-environment verification, partners can test by themselves if they are interested

03 How to use streaming query in MyBatis

Mybatis is an ORM framework that encapsulates streaming queries

The ResultHandler interface contains only the handleResult method, which retrieves the converted Java entity class

@Slf4j @Service public class MyBatisStreamService { @Resource private MyBatisStreamMapper myBatisStreamMapper; public void mybatisStreamQuery() { long start = System.currentTimeMillis(); myBatisStreamMapper.mybatisStreamQuery(new ResultHandler<YOU_TABLE_DO>() { @Override public void handleResult(ResultContext<? extends YOU_TABLE_DO> resultContext) { } }); Log.info (" MyBatis query time :: {} ", system.currentTimemillis () -start); }}Copy the code

In addition to the following annotated application, you can also use.xml files

@Mapper
public interface MyBatisStreamMapper {
    @Options(resultSetType = ResultSetType.FORWARD_ONLY, fetchSize = Integer.MIN_VALUE)
    @ResultType(YOU_TABLE_DO.class)
    @Select("SELECT COLUMN_A, COLUMN_B, COLUMN_C FROM YOU_TABLE")
    void mybatisStreamQuery(ResultHandler<YOU_TABLE_DO> handler);
}
Copy the code

Mybatis stream query call time consumption: ≈ 18s

JDBC streaming vs. MyBatis encapsulated streaming reads

MyBatis is still much slower than native streaming, but this performance is acceptable considering the nature of the underlying packaging
In terms of memory ratio, they fluctuate almost identically
MyBatis is more convenient than native JDBC because it encapsulates features such as callback functions and serialized objects

The specific use of both can be determined according to the actual situation of the project, there is no best, only the most suitable

“Said

Streaming query and cursor query can avoid OOM, and can be considered for large data volume. However, these two methods will occupy the database connection and will not be released during use. Therefore, for online businesses with large data volume, the use of cursor and streaming operations must be controlled concurrently

In addition, for JDBC native streaming query, Mybatis has also been packaged, although it will be slower, but the function and the cleanliness of the code will be much better