“This is the third day of my participation in the First Challenge 2022, for more details: First Challenge 2022”.
preface
Pandas is inefficient for handling large data. Polars is recommended.
Let’s have a good time
The development tools
Python version: 3.6.4
Related modules:
PIL module;
OS module;
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
Pandas can create data, read files in a variety of formats (text, CSV, JSON), or slice and slice data to combine multiple data sources.
Pandas does have some disadvantages, such as its lack of multiprocessors and its slow processing of large data sets.
I introduce you to a new Python library called Polars.
Pandas uses syntax similar to Pandas, but handles data much faster than Pandas.
Polars is a library written through Rust, and Polars’ memory model is based on Apache Arrow.
Polars has two apis: the Eager API and the Lazy API.
The Eager API is used in the same way as the Eager API is used in Pandas.
The Lazy API, like Spark, first converts queries into logical plans and then reorganizes and optimizes the plans to reduce execution time and memory usage.
Install Polars using Baidu PIP source.
# installation polars
pip install polars -i https://mirror.baidu.com/pypi/simple/
Copy the code
After the installation is successful, we test how Pandas and Polars handle data.
The user name data of registered users on a website is analyzed, and the CSV file contains about 26 million user names.
import pandas as pd
df = pd.read_csv('users.csv')
print(df)
Copy the code
The data is as follows
A self-created CSV file was also used for data integration testing
import pandas as pd
df = pd.read_csv('fake_user.csv')
print(df)
Copy the code
results
Compare the sorting algorithm time of the two libraries
import timeit
import pandas as pd
start = timeit.default_timer()
df = pd.read_csv('users.csv')
df.sort_values('n', ascending=False)
stop = timeit.default_timer()
print('Time: ', stop - start)
-------------------------
Time: 27.555776743218303
Copy the code
As you can see, it took about 28s to sort the data using Pandas
import timeit
import polars as pl
start = timeit.default_timer()
df = pl.read_csv('users.csv')
df.sort(by_column='n', reverse=True)
stop = timeit.default_timer()
print('Time: ', stop - start)
-----------------------
Time: 9.924110282212496
Copy the code
Polars took only about 10s, which means Polars is 2.7 times faster than Pandas
Let’s try data integration, vertical connectivity
import timeit
import pandas as pd
start = timeit.default_timer()
df_users = pd.read_csv('users.csv')
df_fake = pd.read_csv('fake_user.csv')
df_users.append(df_fake, ignore_index=True)
stop = timeit.default_timer()
print('Time: ', stop - start)
------------------------
Time: 15.556222308427095
Copy the code
It takes 15 seconds to use Pandas
import timeit
import polars as pl
start = timeit.default_timer()
df_users = pl.read_csv('users.csv')
df_fake = pl.read_csv('fake_user.csv')
df_users.vstack(df_fake)
stop = timeit.default_timer()
print('Time: ', stop - start)
-----------------------
Time: 3.475433263927698
Copy the code
Polars is about 3.5 seconds faster than Pandas
The data used this time can be obtained from the home page
conclusion
Pandas is 12 years old and has developed a mature ecosystem that supports many other data analysis libraries
Polars is a relatively new library and has a lot to be desired
If your dataset is too large for Pandas and too small for Spark, then Polars is an option to consider