Due to the mature use of ORM framework, many partners are weak in the concept of JDBC. The bottom layer of ORM framework is actually DB through JDBC operation
JDBC (JavaDataBase Connectivity) is a Java database connection, which is to use the Java language to operate the database
SUN provides a set of standard API for accessing database, and provides the corresponding protocol standard for connecting database, and then each manufacturer provides a set of API interface for accessing their own database according to the specification
01 MySql JDBC Large Data Operation
The whole article takes large data operation as the topic, and elicits related knowledge points through the requirements in the development process
- Migrating data
- Export data
- Batch processing data
Generally speaking, I think in Java Web programs, can be called large data volume, ranging from hundreds of thousands to tens of millions, and then Java (Web application) processing is not how appropriate
For example, the business system needs to read 500W rows from the MySQL database for processing. What should it do
- Regular queries, reading 500W data into JVM memory at a time, or paging reads
- Stream queries, establish long connections, use server-side cursors, and load them one at a time into JVM memory
- Cursor queries, like streaming, use the fetchSize parameter to control how many pieces of data are read at a time
1.1 General Query
By default, the complete set of retrieval results is stored in memory. In most cases, this is the most efficient way to operate, and is easier to implement because of the design of the MySQL network protocol
Assuming 500W data in a single table, no one will load it into memory at a time, usually paging
@SneakyThrows @Override public void pageQuery() { @Cleanup Connection conn = dataSource.getConnection(); @Cleanup Statement stmt = conn.createStatement(); long start = System.currentTimeMillis(); long offset = 0; int size = 100; while (true) { String sql = String.format("SELECT COLUMN_A, COLUMN_B, COLUMN_C FROM YOU_TABLE LIMIT %s, %s", offset, size); @Cleanup ResultSet rs = stmt.executeQuery(sql); long count = loopResultSet(rs); if (count == 0) break; offset += size; } log.info(" paginate query time :: {} ", system.currentTimemillis () -start); }Copy the code
The above approach is simple, but without LIMIT deep paging optimization, the online database server gets cold, or you can wait a few days to retrieve the data
1.2 Streaming Query
If you are using a ResultSet with a large number of rows and cannot allocate the required memory heap space for it in the JVM, you can tell the driver to return a row from the result stream
Streaming queries have one caveat: all rows in the result set must be read (or closed) before any additional queries can be issued against the connection, or an exception will be thrown
With streaming queries, you want to maintain concurrent access to the tables referenced by the statement that produced the result set, which must be processed as quickly as possible because the query monopolizes the join
@SneakyThrows public void streamQuery() { @Cleanup Connection conn = dataSource.getConnection(); @Cleanup Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(Integer.MIN_VALUE); long start = System.currentTimeMillis(); @Cleanup ResultSet rs = stmt.executeQuery("SELECT COLUMN_A, COLUMN_B, COLUMN_C FROM YOU_TABLE"); loopResultSet(rs); Log.info (" Stream query time :: {} ", (system.currentTimemillis () -start) / 1000); }Copy the code
Flow query database table data volume 500W Single call time consumption: ≈ 6s
1.3 Cursor Query
In SpringBoot 2.x, the default connection pool is HikariPool and the connection object is HikariProxyConnection, so the following cursor setting is not feasible
((JDBC4Connection) conn).setUseCursorFetch(true);
Copy the code
Need to concatenate &usecursorFetch =true in the database connection information. Second, set the number of data that Statement reads at a time, for example, 1000
@SneakyThrows public void cursorQuery() { @Cleanup Connection conn = dataSource.getConnection(); @Cleanup Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(1000); long start = System.currentTimeMillis(); @Cleanup ResultSet rs = stmt.executeQuery("SELECT COLUMN_A, COLUMN_B, COLUMN_C FROM YOU_TABLE"); loopResultSet(rs); Log.info (" Cursor query time :: {} ", (system.currentTimemillis () -start) / 1000); }Copy the code
Cursor query database table data volume 500W Single call time consumption: ≈ 18s
1.4 the JDBC RowData
The method loopResultSet is used above, and the while loop is only carried out inside the method. The core point of conventional, streaming and cursor query is the next method
@sneakythrows private Long loopResultSet(ResultSet rs) {while (rs.next()) {// Business operation} return xx; @sneakythrows private Long loopResultSet(ResultSet rs) {while (rs.next()) {// Business operation} return xx; }Copy the code
The logic of resultSet.next () is to implement the class ResultSetImpl to fetch the next row from RowData each time. RowData is an interface and the implementation diagram is shown below
By default, a ResultSet uses a RowDataStatic instance. When a RowDataStatic object is generated, all records in the ResultSet are read into memory, and then read from memory one by one through next()
The call to RowDataCursor is batched and then internally cached as follows:
- It first checks to see if there is any data in its internal buffer that has not been returned, and returns the next row if there is
- If all the data is read, a new request to MySQL Server is triggered to read the fetchSize result
- Buffers the result into an internal buffer, and then returns the first row of data
When using streaming, a ResultSet uses a RowDataDynamic object, which next() initiates an IO to read a single row of data each time it is called
To summarize, the default RowDataStatic reads all data into client memory, which is our JVM; RowDataCursor reads the fetchSize row once, and then initiates the request after consuming it. RowDataDynamic Reads one piece of data per IO call
1.5 JDBC Communication Mechanism
(1) General query
The interaction between JDBC and MySQL server is done through Socket, corresponding to network programming, MySQL can be treated as a SocketServer, so a complete request link should be:
JDBC client -> client Socket -> MySQL -> retrieve data return -> MySQL kernel Socket Buffer -> Network -> client Socket Buffer -> JDBC client
When querying a large amount of data, the JVM may be cold for the following reasons:
- MySQL Server writes the retrieved SQL result set to the kernel Socket Buffer via the output stream
- The kernel buffer sends back data through the TCP link initiated by JDBC. In this case, the data is first entered into the kernel buffer where the JDBC client resides
- When JDBC initiates an SQL operation, the program blocks on the read operation of the input stream and wakes up to read the buffer data into JVM memory when there is data in the buffer
- MySQL Server continues to send data, and JDBC continues to read buffer data into Java memory. Although the data is stored in the JDBC program, JDBC does not respond to the execute method call because it does not return until the corresponding data is read
- The downside is obvious, if the query volume is too large, you will constantly go through GC and then run out of memory
(2) Cursor query
According to the above, the cursor can solve the memory overflow problem of ordinary query with large data volume, but
MySQL does not know when the client has finished consuming, so another connection causes a DML write to the table.
In fact, when we use the cursor query, MySQL needs to create a temporary space to hold the data that needs to be read, so there is no conflict with DML write operations
But cursor queries can cause the following:
- The IOPS spikes because the data to be returned needs to be written into the temporary space, resulting in a large number of I/O reads and writes. This process may cause write jitter of other services
- Disk space spikes because the data written to the temporary space is outside the original table. If the table data is too large, the database disk may be filled in extreme cases, and the network output does not change. Data written to temporary space is retrieved by MySQL when the read is complete or when the client initiates a ResultSet#close operation
- When a JDBC client initiates an SQL query, it may take a long time to wait for an SQL response. During this period, the server prepares data. However, the common query waiting time is inconsistent with the cursor query waiting time in principle. The former is consistent in reading the data from the network buffer and does not respond to the business level. The latter is MySQL preparing temporary data space and does not respond to JDBC
- After data is prepared, the network response starts to surge during data transmission. IOPS changes from “read/write” to “read”.
The communication efficiency of cursor query is low. After the client consumes fetchSize row data, it needs to send a request to the server. Therefore, the IOPS is very high and occupies a large amount of disk space and performance during the database preparation phase
(3) Streaming query
When the client and MySQL Server establish a connection and exchange queries, MySQL Server will return the SQL result set to output through the output stream, that is, write data to the Socket Buffer corresponding to the local kernel. The data in the kernel is then sent back over the TCP link to the server kernel buffer corresponding to JDBC
- JDBC uses the input stream read method to read kernel buffer data. Since streaming reading is enabled, the business program receives only one piece of data at a time
- The MySQL server will stream data to the client kernel represented by JDBC until the client requests the Socket buffer to be full, at which point the MySQL server will block
- Every time for JDBC client, the data is read from the kernel buffer of this machine, so the performance will be quicker, generally don’t have to worry about the machine kernel countless according to consumption (unless the MySQL server data, the client does not do any business logic, to get the data directly to give up, advance consumption than the client server will happen)
Streaming may seem better than the cursor approach, but things are often not as simple as they appear
- Streaming has a longer impact on the database than cursor queries
- In addition, streaming query depends on network, which leads to network congestion
02 stream cursor memory analysis
Table data volume: 500W
Memory viewing tool: JDK comes with Jvisualvm
To set JVM parameters: -xmx512m -xMS512m
2.1 Memory Usage for a single call
The streaming memory performance report is as follows
The cursor query memory performance report is as follows
Based on memory usage, both cursor queries and streaming queries are good protection against OOM
2.2 Memory usage of concurrent Calls
Concurrent calls: Jmete calls 10 threads in a second concurrently
The streaming memory performance report is as follows
Concurrent calls are also fine for memory usage, with no superimposition
Average consumption of concurrent call time for streaming query: ≈ 55s
The cursor query memory performance report is as follows
Average consumption of cursor query concurrent call time: ≈ 83s
Due to equipment limitations, and some cases can only occur in extreme conditions, so there is no production and test multi-environment verification, partners can test by themselves if they are interested
03 How to use streaming query in MyBatis
Mybatis is an ORM framework that encapsulates streaming queries
The ResultHandler interface contains only the handleResult method, which retrieves the converted Java entity class
@Slf4j @Service public class MyBatisStreamService { @Resource private MyBatisStreamMapper myBatisStreamMapper; public void mybatisStreamQuery() { long start = System.currentTimeMillis(); myBatisStreamMapper.mybatisStreamQuery(new ResultHandler<YOU_TABLE_DO>() { @Override public void handleResult(ResultContext<? extends YOU_TABLE_DO> resultContext) { } }); Log.info (" MyBatis query time :: {} ", system.currentTimemillis () -start); }}Copy the code
In addition to the following annotated application, you can also use.xml files
@Mapper
public interface MyBatisStreamMapper {
@Options(resultSetType = ResultSetType.FORWARD_ONLY, fetchSize = Integer.MIN_VALUE)
@ResultType(YOU_TABLE_DO.class)
@Select("SELECT COLUMN_A, COLUMN_B, COLUMN_C FROM YOU_TABLE")
void mybatisStreamQuery(ResultHandler<YOU_TABLE_DO> handler);
}
Copy the code
Mybatis stream query call time consumption: ≈ 18s
JDBC streaming vs. MyBatis encapsulated streaming reads
- MyBatis is still much slower than native streaming, but this performance is acceptable considering the nature of the underlying packaging
- In terms of memory ratio, they fluctuate almost identically
- MyBatis is more convenient than native JDBC because it encapsulates features such as callback functions and serialized objects
The specific use of both can be determined according to the actual situation of the project, there is no best, only the most suitable
“Said
Streaming query and cursor query can avoid OOM, and can be considered for large data volume. However, these two methods will occupy the database connection and will not be released during use. Therefore, for online businesses with large data volume, the use of cursor and streaming operations must be controlled concurrently
In addition, for JDBC native streaming query, Mybatis has also been packaged, although it will be slower, but the function and the cleanliness of the code will be much better