The article directories

    • These reviews
    • Dask framework
    • Dask was used for data analysis
    • The difference
      • 1, compute obtain the result
      • 2. Some methods do not support all parameters
      • 3, advice,
    • Build Dask parallel computing method

These reviews

Learn some Data Analysis with me — Day 6: Data Visualization (Seaborn)


Dask framework

Dask is a flexible parallel computing library for analytical computing.

If you want to install a dataframe, you need to import the dataframe module directly, for example: import dataframe

Dask was used for data analysis

For data analysis, dask uses pandas as its core, but wraps a parallel computing framework around it.

import dask.dataframe as dd

df = dd.read_csv('movie.csv',assume_missing=True)

print(df.head(5).T)
Copy the code

You can get an error, like it’s not fully installed, but it will tell you how to fully install. Another example is the need for the parameter assume_missing. I haven’t seen it before, but I can add it anyway.

You will find that the code is not that different from pandas.


The difference

1, compute obtain the result

import dask.dataframe as dd

df = dd.read_csv('movie.csv',assume_missing=True)

print(df.describe().compute())
Copy the code

If you don’t believe me, take compute out and see what comes out of it. It tells you how many computing tasks are waiting.

2. Some methods do not support all parameters

For example, Value_Counts does not support bins, sortby, normalize, and Ascending.

3, advice,

When calculating large data sets, it is recommended to benchmark a small number of data sets first. Make a rough estimate of the time.


Build Dask parallel computing method

PIP install dask[complete] : PIP install dask[complete] : PIP install dask

2. On the CLI, enter dask-scheduler.

3, see TCP address and port number? (Check the task manager status: http://address)

TCP :// TCP :// TCP :// TCP :// TCP :// TCP :// TCP

5. Submit the task after connecting to the manager

import dask.dataframe as dd

from dask.distributed import Client

df = dd.read_csv('movie.csv',assume_missing=True)

client = Client(address='192.168.0.102:8786')

df.groupby('duration').duration.count().compute()
Copy the code

Without further ado, my mom asked me to dinner.