Create 5 million levels of MySQL data quickly, and enjoy learning various optimization techniques

I am a kite, the public number “ancient kite”, a both depth and breadth of programmers encourage division, a plan to write poetry but write up the code of rural code farmers! Articles will be included in JavaNewBee, and there will be a Java back-end knowledge map, which will cover the path from small white to big cow.

If you’re going to take a good look at MySQL, performance tuning isn’t going to get you anywhere. When you roll up your sleeves and get ready to start, you suddenly realize that there is not that much data in your local database. Production database data, but who dares to directly in the production environment ah, want to be optimized in advance?

You know, programmers never give up. We create data without data, and we’re good at creating new objects, let alone millions of pieces of data.

Use official data

The authorities obviously knew we needed some test data for an exercise or something, so they prepared a copy for us. Can be downloaded to the https://github.com/datacharmer/test_db, this database contains about 300000 2.8 million salary items, employee records and file size is 167 M.

After downloading, run the SQL file directly using the MySQL client.

Or run the command and enter the password to import.

mysql -u root -p < employees.sql

Copy the code

This is the simplest method, as long as you can download the SQL file. However, the amount of data is not large enough. The employee table only has 300,000 pieces of data, which is not enough for millions. Moreover, the fields are defined and cannot be flexibly customized.

The background that

The way to create mega data is to achieve two goals:

Customization is flexible, you can’t just have one or two fields, it doesn’t really make sense.
Fast, not millions of data for hours or even longer, it can’t receive.

The goal is to create two tables, a user table and an order table, not as many table fields as in the real world, but enough for the learning test.

The table structure of the two tables is as follows:

CREATE TABLE 'user' (' id 'varchar(36) NOT NULL,' user_name 'varchar(12) DEFAULT NULL, `age` tinyint(3) DEFAULT NULL, `phone` varchar(11) DEFAULT NULL, `province` varchar(10) DEFAULT NULL, `city` varchar(10) DEFAULT NULL, `create_time` datetime DEFAULT NULL, `update_time` datetime DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;The order sheet
Copy the codeCREATE TABLE order (
id varchar(36) NOT NULL,
user_id varchar(36) DEFAULT NULL,
product_count int(11) DEFAULT NULL,
priceA decimal (10, 0) DEFAULT NULL,create_time datetime DEFAULT NULL,
update_time datetime DEFAULT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Create 5 million data sets for user table (user), random age from 1 to 120, random 11-digit phone number, random province code and city code, random creation time and update time within a certain time range.

The order table is generated according to the user table. Each user randomly generates 0 to 3 orders, the order number is uUID, the quantity of goods is 1 to 5, the price is random, and the creation time and update time are random in a certain period. Since each user generates 0 to 3 orders, the number of orders generated should be greater than 5 million, which was generally in the 7 + million range when I was running locally.

The total creation time is directly related to the number of fields in the table and the algorithm used to generate the fields. The more fields and the more complex the algorithm, the more time it takes. For example, using a UUID takes longer than using an increment ID, and the random time takes longer than using the current time.

If you just insert the 5 million increment ID field, it can be done in a dozen seconds, but it doesn’t make sense to simulate an online environment or to teach yourself performance optimization techniques.

Here are three ways to quickly create 5 million user data and more than 5 million order data.

Write a program to batch insert

As a developer, when you’re trying to create millions of pieces of data, most of the time the first thing to do is write programs, because CURD is what we do best.

There are two ways to insert programs. The first way is to insert items one by one, which is the most commonly used method in development. Intuitively, we might think that this is faster. This is not the case, although it is much faster than manually inserting one line at a time. However, it is very likely that you will lose patience after waiting some time and end the program, regardless of which database connection pool you are using, and it will still be ridiculously slow in front of millions of orders of magnitude.

The second case is to use MySQL’s batch insert method. We all know that MySQL supports multiple records to be inserted at once, which is the following form.

insert into `table_name` (id,column1) values (1,'value1'),(2,'value2'),(3,'value3');
Copy the code

This is much faster than if you were to do one statement at a time, such as 5,000 inserts for 1000 records, or 5 million inserts for each one.

Since the latter two methods use Python generation files, this method is also implemented in Python, with the following example code. The full code is available on Github at the end of this article.

def insert_data(self):

  cursor = self.conn.cursor()

  for x in range(5000) :

    insert_user_sql = "" "

            insert into `user` ( `id`,`user_name`,`phone`,`age`, `province`, `city`, `create_time`,`update_time` )

                    VALUES(%s,%s,%s,%s,%s,%s,%s,%s)

"" "

    insert_order_sql = """ insert into `order` ( `id`, `product_count`, `user_id`, `price`, `create_time`, `update_time`) 

                               values(%s,%s,%s,%s,%s,%s)

"" "

    user_values, order_values = [], []

    for i in range(1000) :

      timestamp = self.randomTimestamp()

      time_local = time.localtime(timestamp)

      createTime = time.strftime("%Y-%m-%d %H:%M:%S", time_local)

      user_id = str(uuid.uuid4())

      user_values.append(

        (user_id, "Name" + str(x) + str(i), self.createPhone(), random.randint(1.120),

         str(random.randint(1.26)),

         str(random.randint(1.1000)), createTime, createTime))



      random_order_count = random.randint(0.3)

      if random_order_count > 0:

        for c in range(random_order_count):

          timestamp = self.randomTimestamp()

          time_local = time.localtime(timestamp)

          order_create_time = time.strftime("%Y-%m-%d %H:%M:%S", time_local)

          order_values.append((str(uuid.uuid4()), random.randint(1.5), user_id,

                               random.randint(10.2000), order_create_time, order_create_time))

          cursor.executemany(insert_user_sql, user_values)

          cursor.executemany(insert_order_sql, order_values)

          self.conn.commit()



          cursor.close()

Copy the code

After a long wait, the run was complete, taking 1823 seconds, or 30 minutes.

Finally, 5 million user records and more than 7.49 million order records were successfully generated.

The speed is acceptable, I think, just so-so.

I ran it with 5 threads, and one thread inserted 1 million. The longest thread took 1294 seconds, 21 minutes, not much faster. The number of threads affected the time somewhat, but I didn’t try it.

Generating SQL scripts

This method is similar to the above method, except that the above method executes the concatenated SQL statement directly through the program, and this method is to write the concatenated SQL statement to a file. Again, in the form of a single statement inserting multiple rows.

insert into `table_name` (id,column1) values (1,'value1'),(2,'value2'),(3,'value3');
Copy the code

Write 5 million user data, plus random order data, SQL file process takes 696 seconds, about 11 minutes.

Of course, such a large amount of data splicing out of the script file is also very large, user table script 680 M, order table script 1 G.

Finally, the two files will be executed separately in MySQL.

Execute the user table script, which takes about 3 minutes.

mysql -uroot -p mast_slave < sql/insert_user_500w.sql

Copy the code

It took about 7 minutes to execute the order table script, with more than 7.5 million orders.

mysql -uroot -p mast_slave < sql/insert_order_500w+.sql

Copy the code

The total time, 20 minutes or so, plus the middle of the manual operation, the feeling is not as easy as the first method in the multi-threaded way.

Load Data infile mode

The final method is to use load Data Infile, which is a quick way to import files from MySQL. For example, according to the specific symbol separation, import the corresponding field.

In this example, I used comma-separated fields to generate 5 million user lines and random order lines.

The file is still generated using a Python script, which takes 779 seconds, or about 12 minutes.

The two file sizes are 560 + MB and 900 MB respectively.

Finally, run the load data infile command to import the file to the corresponding table. After this command is executed, the following error message may be displayed.

ERROR 1290 (HY000): The MySQL server is running with the –secure-file-priv option so it cannot execute this statement

This is due to MySQL’s own security configuration. You need to change my.cnf to add the following configuration and restart the service.

secure_file_priv=

Copy the code

If an equals sign is left blank, files from all directories are allowed to load. If you want to specify a specific directory, fill in the corresponding file directory after the equals sign.

Then execute the following statement to import user records into the User table.

Load data infile '/Users/fengzheng/ knowledge management/Technical writing /mysql/ create test data/SQL /load_user_txt_500w.txt' replace into table user FIELDS TERMINATED BY ',';Copy the code

Five million takes three minutes and 32 seconds.

Import the order record into the ORDER table.

Load data infile '/Users/fengzheng/ knowledge management/technical writing /mysql/ create test data/SQL /load_order_txt_500w+.txt' replace into table 'order' FIELDS TERMINATED BY ',';Copy the code

7.49 million records, 8 minutes and 31 seconds.

The whole process adds up to about 24 minutes.

The last

Ok, now you can have fun doing all kinds of testing and tuning.

Some students may say after watching, more than 20 minutes seems not fast ah. Because the amount of data is really quite large, and the complexity of the data and the import time is also very important, if you just import a column of increment ID, let alone 5 million, 10 million can be completed in less than a minute.

In fact, there is a little room for optimization, for example, to change the database engine to MYISAM will be faster, especially for batch inserts, but after the inserts have to change back, also takes some time, and switching back and forth is more troublesome.

All of the above methods work with Python scripts, although you can switch to a language you’re familiar with, such as Java, or write bash scripts directly.

The script has been posted on Github, please help yourself if you need it. Address:

Click to get the source code

Strong man wait, first give a praise bar, always white piao, the body can not bear!

Public account “ancient kite”, Java developer, full stack engineer, bug killer, good at solving problems. A programmer with both depth and breadth of encouragement teacher, originally intended to write poetry but wrote up the code of rural code farmers! Stick to original dry goods output, you can choose to pay attention to me now, or read a historical article and then pay attention to it. Long press the QR code to follow, become excellent with me!

Create 5 million levels of MySQL data quickly, and enjoy learning various optimization techniques

Use official data

The background that

The order sheet

Write a program to batch insert

Generating SQL scripts

Load Data infile mode

The last

Related Posts

Netty based on a simple support HTTP and webSocket protocol server (including XXL-job communication module source analysis)

What skills do developers have that make a good job

Take an in-depth look at the principles, implementation, and optimization of HashMap in JDK8