• High-speed inserts with MySQL
  • Originally by Benjamin Morel
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Prince Stuart
  • Proofreader: GJXAIOU, QinRoc

When you need to batch INSERT millions of entries into a MySQL database, you will realize that sending INSERT statements one by one is not a viable option.

There are some INSERT optimization tips worth reading in the MySQL documentation.

In this article, I will outline two techniques for efficiently loading data into a MySQL database.

LOAD DATA INFILE

If you’re looking for ways to improve raw performance, this is definitely your first choice. LOAD DATA INFILE is a highly optimized statement for MySQL that inserts DATA directly from CSV/TSV files into tables.

There are two ways to use LOAD DATA INFILE. You can copy the data files to the server data directory (usually /var/lib/mysql-files/) and run:

LOAD DATA INFILE '/path/to/products.csv' INTO TABLE products;
Copy the code

This approach is quite cumbersome because you need to access the server’s file system, set the proper permissions for the data files, and so on.

The good news is that you can also store data files on the client and use the LOCAL keyword:

LOAD DATA LOCAL INFILE '/path/to/products.csv' INTO TABLE products;
Copy the code

In this case, the file is read from the client file system, transparently copied to the server temporary directory, and then imported from that directory. All in all, this is almost as fast as loading files directly from the server file system, though you need to make sure the server has this option enabled.

LOAD DATA INFILE has many options, mostly related to the structure of the DATA file (field delimiters, attachments, and so on). Please browse the documentation to see the full contents.

LOAD DATA INFILE is the best option from a performance standpoint, but it requires you to export the DATA to a comma-separated text file first. If you do not have such files, you will spend additional resources to create them and may add some degree of complexity to your application. Fortunately, there is an alternative.

Extended Inserts

A typical INSERT SQL statement looks like this:

INSERT INTO user (id.name) VALUES (1.'Ben');
Copy the code

Extended INSERT aggregates multiple INSERT records into a single query:

INSERT INTO user (id.name) VALUES (1.'Ben'), (2.'Bob');
Copy the code

The key is to find the optimal number of records to insert in each statement. There is no one-size-fits-all number, so you need to benchmark your data sample to find the maximum performance benefit or the best tradeoff between memory usage and performance.

To take full advantage of extended INSERT, we also recommend:

  • Using preprocessed statements
  • Run the statement in a transaction

The benchmark

I’m going to insert 1.2 million records, each consisting of six mixed types of data, each of which is about 26 bytes in size on average. I used two common configurations for testing:

  • The client and server communicate over UNIX sockets on the same machine
  • Clients and servers communicate on different machines over gigabit networks with very low latency (less than 0.1 milliseconds)

As a basis for comparison, I use INSERT… SELECT replicates the table, which has a performance of 313,000 inserts per second.

LOAD DATA INFILE

To my surprise, LOAD DATA INFILE proved to be faster than copying tables:

  • LOAD DATA INFILEPer second:377000Time to insert
  • LOAD DATA LOCAL INFILEOver the network: per second322000Time to insert

The difference between these two numbers seems to be directly related to the time it takes to transfer data from client to server: The data file size is 53 MB, and the time difference between the two benchmarks is 543 ms, which represents a transfer speed of 780 MBPS, close to gigabit speed.

This means that, most likely, the MySQL server has not started processing the file before it has been fully transferred: thus, the speed of the insert is directly related to the bandwidth between the client and server, which is important to consider if they are not on the same machine.

Extended inserts

To test the insert speed, I used BulkInserter, which is part of the OPEN source library PHP class I wrote, to insert up to 10,000 records per query:

As we can see, as the number of inserts per query increases, the insert speed increases rapidly. We improved performance by a factor of 6 on localhost and by a factor of 17 on web host compared to strip by strip insertion speed:

  • The number of inserts per second on localhost increased from 40,000 to 247,000
  • The number of insertions per second on network hosts increased from 1,2000 to 201,000

Both cases require about 1,000 inserts per query to achieve maximum throughput. But 40 inserts per query is enough for 90% throughput on the localhost, which is probably a good tradeoff. Also note that after peaking, performance actually degrades as the number of inserts per query increases.

The benefits of Extended INSERT are even more obvious in the case of network connections, since the speed of continuous inserts depends on your network latency.

max sequential inserts per second ~= 1000 / ping in milliseconds
Copy the code

The higher the latency between client and server, the more you will benefit from extended INSERT.

conclusion

Not surprisingly, LOAD DATA INFILE is the preferred solution for improving performance on a single connection. It requires that you prepare the file in the correct format, and if you must prepare the file and/or transfer it to the database server, make sure that the time spent in this process is taken into account when testing the insert speed.

Extended Inserts, on the other hand, do not require temporary text files and can achieve a throughput equivalent to 65% of LOAD DATA INFILE, which is a very reasonable insert speed. Interestingly, aggregating multiple inserts into a single query always results in better performance, whether network-based or localhost.

If you decide to start using extended Insert, be sure to test your environment with production data samples and a few different insert numbers to find the best values.

Be careful when increasing the number of inserts in a single query, so it may require:

  • Allocate more memory on the client side
  • Add max_allowed_packet parameter configuration for MySQL server.

Finally, it’s worth mentioning that according to Percona, you can use concurrent connections, partitions, and multiple buffer pools for better performance. Check out this post on their blog for more information.

The benchmark was run on a bare server with Centos 7 and MySQL 5.7 with an Xeon E3 @3.8 GHz processor, 32 GB OF RAM, and NVMe SSDS. MySQL’s benchmark tables use the InnoBD storage engine.

The source code for the benchmark is saved atgistThe result graph is saved inplot.lyOn.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.