The article directories
-
- These reviews
- Dask framework
- Dask was used for data analysis
- The difference
-
- 1, compute obtain the result
- 2. Some methods do not support all parameters
- 3, advice,
- Build Dask parallel computing method
These reviews
Learn some Data Analysis with me — Day 6: Data Visualization (Seaborn)
Dask framework
Dask is a flexible parallel computing library for analytical computing.
If you want to install a dataframe, you need to import the dataframe module directly, for example: import dataframe
Dask was used for data analysis
For data analysis, dask uses pandas as its core, but wraps a parallel computing framework around it.
import dask.dataframe as dd
df = dd.read_csv('movie.csv',assume_missing=True)
print(df.head(5).T)
Copy the code
You can get an error, like it’s not fully installed, but it will tell you how to fully install. Another example is the need for the parameter assume_missing. I haven’t seen it before, but I can add it anyway.
You will find that the code is not that different from pandas.
The difference
1, compute obtain the result
import dask.dataframe as dd
df = dd.read_csv('movie.csv',assume_missing=True)
print(df.describe().compute())
Copy the code
If you don’t believe me, take compute out and see what comes out of it. It tells you how many computing tasks are waiting.
2. Some methods do not support all parameters
For example, Value_Counts does not support bins, sortby, normalize, and Ascending.
3, advice,
When calculating large data sets, it is recommended to benchmark a small number of data sets first. Make a rough estimate of the time.
Build Dask parallel computing method
PIP install dask[complete] : PIP install dask[complete] : PIP install dask
2. On the CLI, enter dask-scheduler.
3, see TCP address and port number? (Check the task manager status: http://address)
TCP :// TCP :// TCP :// TCP :// TCP :// TCP :// TCP
5. Submit the task after connecting to the manager
import dask.dataframe as dd
from dask.distributed import Client
df = dd.read_csv('movie.csv',assume_missing=True)
client = Client(address='192.168.0.102:8786')
df.groupby('duration').duration.count().compute()
Copy the code
Without further ado, my mom asked me to dinner.