Akik Tongo takes off
There are many ways to increase the speed of Pandas. Dask is often mentioned in the book.
1. What is Dask?
Pandas is familiar with Numpy. When the code is run, the data is loaded into RAM, and if the dataset is very large, we will see a memory surge. But sometimes the data you need to process doesn’t fit in RAM, and that’s when Dask comes in.
Dask is open source and free. It was developed in coordination with other community projects such as Numpy, Pandas, and SciKit-Learn.
Official: dask.org/
Dask supports Pandas’ DataFrame and NumpyArray data structures, and can be extended to run on local computers or clusters.
Basically, you write the code once, using normal Pythonic syntax, and you can run it locally or deploy it to a multi-node cluster. This is a cool feature in itself, but it’s not the coolest.
What I think is really cool about Dask is that it’s compatible with most of the tools we already use, and you can run code in parallel using the processing power you already have on your laptop with only a few code changes. Processing data in parallel means less execution time, less waiting time, and more analysis time.
This is the one down hereDask
The general flow of data processing.
2. What existing tools does Dask support?
This is also one of my favorite points, because Dask is compatible with Python’s data processing and modeling library and uses the library’s API, which is extremely cheap for Python users to learn. However, big data processing such as Hadoop and Spark has high learning threshold and time cost.
At present,Dask
Can supportpandas
,Numpy
,Sklearn
,XGBoost
,XArray
,RAPIDS
And so on, this is enough for me to use, at least for the common data processing, modeling analysis is completely covered.
3, Dask installation
You can use Conda or PIP, or install dask from source code.
conda install dask
Copy the code
Because Dask has many dependencies, the following code can also be used for a quick installation to install the minimum set of dependencies required to run dask.
conda install dask-core
Copy the code
Another is to install from the source.
git clone https://github.com/dask/dask.git
cd dask
python -m pip install .
Copy the code
4. How to use Dask?
Numpy, pandas
Dask introduces three parallel collections that can store data larger than RAM, such as DataFrame, Bags, and Arrays. Each of these collection types can use data partitioned between RAM and hard disk, as well as data distributed across multiple nodes in the cluster.
The use of Dask is pretty clear. If you use a NumPy array, start with a Dask array, if you use Pandas DataFrame, start with a Dask DataFrame, and so on.
import dask.array as da
x = da.random.uniform(low=0, high=10, size=(10000.10000), # normal numpy code
chunks=(1000.1000)) # break into chunks of size 1000x1000
y = x + x.T - x.mean(axis=0) # Use normal syntax for high level algorithms
# DataFrames
import dask.dataframe as dd
df = dd.read_csv('2018-*-*.csv', parse_dates='timestamp'.# normal Pandas code
blocksize=64000000) # break text into 64MB chunks
s = df.groupby('name').balance.mean() # Use normal syntax for high level algorithms
# Bags / lists
import dask.bag as db
b = db.read_text('*.json').map(json.loads)
total = (b.filter(lambda d: d['name'] = ='Alice').map(lambda d: d['balance']).sum())
Copy the code
These advanced interfaces replicate standard interfaces with minor variations. For most of the apis in the original project, these interfaces automatically parallel processing large data sets for us, which is not very complicated to implement and can be done step by step with Dask’s DOC documentation.
Delayed
Let’s talk about Dask’s Delay function, which is very powerful.
Delayed is a simple and powerful way to parallelize existing code. It was called delayed because it did not compute the result immediately, but was to be recorded as the result of the task computation in a graph that was to be run later on parallel hardware.
Sometimes the problem may not be appropriate with either the existing dck. array or DCK. dataframe, in which case we can parallelize the custom algorithm with the simpler DCK. delayed interface. Take the following example.
def inc(x) :
return x + 1
def double(x) :
return x * 2
def add(x, y) :
return x + y
data = [1.2.3.4.5]
output = []
for x in data:
a = inc(x)
b = double(x)
c = add(a, b)
output.append(c)
total = sum(output)
45
Copy the code
The above code runs sequentially in a single thread. However, we see that many of them can be executed in parallel. The Dask delayed function decorates functions such as inc and double so that they can be delayed rather than executed immediately, and it puts the function and its parameters into the calculation task graph.
Let’s just change the code and wrap it with the delayed function.
import dask
output = []
for x in data:
a = dask.delayed(inc)(x)
b = dask.delayed(double)(x)
c = dask.delayed(add)(a, b)
output.append(c)
total = dask.delayed(sum)(output)
Copy the code
The inc, double, Add, and sum do not occur after the code is run. Instead, a task map of the calculation is generated and handed to Total. Then let’s visualizatize the task diagram.
total.visualize()
Copy the code
The figure above clearly sees the possibility of parallelism, so do not hesitate to use itcompute
Parallel computation, and that’s when we’re done.
>>> total.compute()
45
Copy the code
Because the data set is too small to compare the time, here only introduces the use of methods, specific hands-on practice.
Sklearn machine learning
The parallel execution of machine learning will be covered in another article. So a quick word about dask-learn.
The Dask-Learn project was done in collaboration with the Sklearn developers. There are scikit-Learn pipelines, GridsearchCV and RandomSearchCV, and variations of these that can better handle nested parallel operations.
Therefore, if you replace sklearn with dklearn, the speed will increase a lot.
# from sklearn.grid_search import GridSearchCV
from dklearn.grid_search import GridSearchCV
# from sklearn.pipeline import Pipeline
from dklearn.pipeline importPipeline Here is an example of using a Pipeline with PCA and logistic regression applied.from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000,
n_features=500,
n_classes=2,
n_redundant=250,
random_state=42)
from sklearn import linear_model, decomposition
from sklearn.pipeline import Pipeline
from dklearn.pipeline import Pipeline
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca),
('logistic', logistic)])
grid = dict(pca__n_components=[50.100.150.250],
logistic__C=[1e-4.1.0.10.1e4],
logistic__penalty=['l1'.'l2'])
# from sklearn.grid_search import GridSearchCV
from dklearn.grid_search import GridSearchCV
estimator = GridSearchCV(pipe, grid)
estimator.fit(X, y)
Copy the code
The result: SkLearn performs this calculation in about 40 seconds, while the dask-Learn alternative takes about 10 seconds. In addition, if you add the following code to connect to the cluster, the Client can display the dashboard of the entire calculation process, implemented by Bokeh.
from dask.distributed import Client
c = Client('scheduler-address:8786')
Copy the code
5, summary
The above is a simple introduction to Dask. The function of Dask is very powerful, and the documentation is very complete, with both examples and explanations. If you are interested, you can go to GitHub or the official website to learn. Next time, Donge will share some examples of machine learning using Dask.
Original is not easy, think good point like.
Welcome everyone to follow my original wechat official account Python data Science, specializing in writing data algorithms based on Python, machine learning, deep learning core dry products, personal website: DataDeepin