The original address www.cyberlight.xyz/passage/dat…

This note tests the performance of big data queries for Mysql, Pandas, and Python lists.

In the hands of a database table of more than 720,000 rows, take this opportunity to test the data query performance of the three, finally solve the inner question.

 

Test environment :Ubuntu 20.04LTS, Python 3.8.2, Intel® Core™ I7-8750H CPU @ 2.20GHz x 12

The original data has 8 columns, 724100 rows, and the SQL file size is 65.5 MB

The eighth column in the data table is time. The following three methods are used to record the time of each method in descending order of time (each method is tested for 3 times and the average value is taken).

 

A, Mysql,

1. Test performance by calling Mysql from Python

The following is the test source code

import pandas as pd import mysql.connector import datetime def connect(): Mydb = mysql.connector. Connect (host="127.0.0.1", user="root", passwd=" SDDDDDDDD ", database="abc" ) return mydb def sql(): mydb = connect() mycursor = mydb.cursor() start = datetime.datetime.now() mycursor.execute("SELECT * FROM My_table ORDER Data_sql = mycursor.fetchall() print(end-start) data_sql = mycursor.fetchall() mycursor.close() mydb.close() return data_sql def main(): data_sql = sql() if __name__ == "__main__": main()Copy the code

Test results are shown in figure

The average value of the three times is 0.652s

This is just the performance of a Python call to Mysql. Is this different from the real performance of Mysql? To avoid Python errors, I then tested the performance of a set of queries directly with Mysql.

2, test performance, directly through Mysql query

Enter the following command to sort the query

SELECT * FROM My_table ORDER BY time DESCCopy the code

The time obtained from the first test is shown below

The test times were 0.668s, 0.664s and 0.702s respectively, with an average of 0.678s

As a result, Python calls to Mysql and queries directly from Mysql have negligible performance.

 

Second, the Pandas

The following is the test code for Pandas

import pandas as pd import mysql.connector import datetime def connect(): Mydb = mysql.connector. Connect (host="127.0.0.1", user="root", passwd=" SDDDDDDDD ", database="abc" ) return mydb def pa(): mydb = connect() mycursor = mydb.cursor() mycursor.execute("SELECT * FROM eb7_today") data_sql = mycursor.fetchall() data = pd.DataFrame(data_sql, columns=['1', '2', '3', '4', '5', '6', '7', 'time']) start = datetime.datetime.now() data.sort_values('time', ascending=False, End = datetime.datetime.now() print(end-start) # mycursor.close() mydb.close() return data def main(): data = pa() if __name__ == "__main__": main()Copy the code

The following are the test results

The average value of the three times is 0.433s

 

Python lists

Here is the test source for the Python list

import pandas as pd import mysql.connector import datetime def connect(): Mydb = mysql.connector. Connect (host="127.0.0.1", user="root", passwd=" SDDDDDDDD ", database="abc" ) return mydb def py(): mydb = connect() mycursor = mydb.cursor() mycursor.execute("SELECT * FROM My_table") data_sql = mycursor.fetchall() start = datetime.datetime.now() data_sql.sort(key=lambda x:x[7], Reverse =True) # end = datetime.datetime.now() print(end-start) # mycursor.close() mydb.close() return data_sql def main(): data_sql = py() if __name__ == "__main__": main()Copy the code

The test results are as follows

The average value of the three times is 0.064s

 

Four,

1. The performance of calling Mysql in Python is almost the same as that of using Mysql directly, and is negligible.

2. Big data query performance Python list > Pandas > Mysql

The query time is 0.652s for Mysql, 0.433s for Pandas, and 0.064s for Python list

 

Python big data analysis performance is very strong

Pandas is written for USE in NumPy, and its internal implementation is written for USE in C. (about 5 times stronger) hope to know the friends message, thank you! Progress together ~