The original address www.cyberlight.xyz/passage/dat…
This note tests the performance of big data queries for Mysql, Pandas, and Python lists.
In the hands of a database table of more than 720,000 rows, take this opportunity to test the data query performance of the three, finally solve the inner question.
Test environment :Ubuntu 20.04LTS, Python 3.8.2, Intel® Core™ I7-8750H CPU @ 2.20GHz x 12
The original data has 8 columns, 724100 rows, and the SQL file size is 65.5 MB
The eighth column in the data table is time. The following three methods are used to record the time of each method in descending order of time (each method is tested for 3 times and the average value is taken).
A, Mysql,
1. Test performance by calling Mysql from Python
The following is the test source code
import pandas as pd import mysql.connector import datetime def connect(): Mydb = mysql.connector. Connect (host="127.0.0.1", user="root", passwd=" SDDDDDDDD ", database="abc" ) return mydb def sql(): mydb = connect() mycursor = mydb.cursor() start = datetime.datetime.now() mycursor.execute("SELECT * FROM My_table ORDER Data_sql = mycursor.fetchall() print(end-start) data_sql = mycursor.fetchall() mycursor.close() mydb.close() return data_sql def main(): data_sql = sql() if __name__ == "__main__": main()Copy the code
Test results are shown in figure
The average value of the three times is 0.652s
This is just the performance of a Python call to Mysql. Is this different from the real performance of Mysql? To avoid Python errors, I then tested the performance of a set of queries directly with Mysql.
2, test performance, directly through Mysql query
Enter the following command to sort the query
SELECT * FROM My_table ORDER BY time DESCCopy the code
The time obtained from the first test is shown below
The test times were 0.668s, 0.664s and 0.702s respectively, with an average of 0.678s
As a result, Python calls to Mysql and queries directly from Mysql have negligible performance.
Second, the Pandas
The following is the test code for Pandas
import pandas as pd import mysql.connector import datetime def connect(): Mydb = mysql.connector. Connect (host="127.0.0.1", user="root", passwd=" SDDDDDDDD ", database="abc" ) return mydb def pa(): mydb = connect() mycursor = mydb.cursor() mycursor.execute("SELECT * FROM eb7_today") data_sql = mycursor.fetchall() data = pd.DataFrame(data_sql, columns=['1', '2', '3', '4', '5', '6', '7', 'time']) start = datetime.datetime.now() data.sort_values('time', ascending=False, End = datetime.datetime.now() print(end-start) # mycursor.close() mydb.close() return data def main(): data = pa() if __name__ == "__main__": main()Copy the code
The following are the test results
The average value of the three times is 0.433s
Python lists
Here is the test source for the Python list
import pandas as pd import mysql.connector import datetime def connect(): Mydb = mysql.connector. Connect (host="127.0.0.1", user="root", passwd=" SDDDDDDDD ", database="abc" ) return mydb def py(): mydb = connect() mycursor = mydb.cursor() mycursor.execute("SELECT * FROM My_table") data_sql = mycursor.fetchall() start = datetime.datetime.now() data_sql.sort(key=lambda x:x[7], Reverse =True) # end = datetime.datetime.now() print(end-start) # mycursor.close() mydb.close() return data_sql def main(): data_sql = py() if __name__ == "__main__": main()Copy the code
The test results are as follows
The average value of the three times is 0.064s
Four,
1. The performance of calling Mysql in Python is almost the same as that of using Mysql directly, and is negligible.
2. Big data query performance Python list > Pandas > Mysql
The query time is 0.652s for Mysql, 0.433s for Pandas, and 0.064s for Python list
Python big data analysis performance is very strong
Pandas is written for USE in NumPy, and its internal implementation is written for USE in C. (about 5 times stronger) hope to know the friends message, thank you! Progress together ~