background

The PyODPS DataFrame provides an interface similar to PANDAS for manipulating ODPS data. It also supports manipulating ODPS data locally with PANDAS and using a database.

In addition to supporting the Map and apply methods similar to PANDAS, the PyODPS DataFrame also provides the MapReduce API to extend pandas’ syntax to adapt to the big data environment.

PyODPS ‘custom functions are serialized into MaxCompute. The Python environment for MaxCompute contains only numpy, a third party package. How to use libraries containing C code such as pandas, scipy, or SciKit-learn in custom functions?

MaxCompute’s Isolation in Sprint 27 and later now makes it possible to use these packages in custom functions. PyODPS also requires at least version 0.7.4. I’ll explain how to use it in detail.

steps

Upload third-party packages (just do it once)

This step only needs to be done once, when the MaxCompute resource has these packages, this step is skipped.

These major Python packages now provide WHL packages that contain binaries for each platform, so finding a package that can run on MaxCompute is the first step.

Second, to run on MaxCompute, you need to include all dependencies, which can be tedious. We can take a look at the dependencies of each package (delete means included)

The package name Rely on
pandas numpy, python-dateutil, pytz, six
scipy numpy
scikit-learn numpy, scipy

To ensure that pandas, Scipy, and sciKit-learn are available, we need to upload the python-dateutil, Pytz, PANDAS, scipy, sklearn, and SIX packages.

We directly through the mirrors.aliyun.com/pypi/simple come to package. The first is python-dateutils:mirrors.aliyun.com/pypi/simple… . We find the latest version of the zip package, python-dateutil-2.6.0.zip, which is pure Python.



Rename it python-dateutil.zip and upload the resource through the MaxCompute Console.

add archive python-dateutil.zip;Copy the code


As with pytz, find pytz-2017.2. Upload not table.

Six finds six-1.11.0.tar.gz.

WHL: cp27-cp27m-manyLinux1_x86_64: WHL: cp27-cp27m-manyLinux1_x86_64: CP27-cp27m-manyLinux1_x86_64 Thus, we find the latest version of the package: pandas 0.20.2-cp27-cp27m-manyLinux1_x86_64.whl.

Here we change the suffix to zip, upload.

add archive pandas.zip;Copy the code


The same goes for the other packages, so let’s list them all:

The package name The file name Upload resource name
python-dateutil Python – dateutil – server. Zip python-dateutil.zip
pytz Pytz – 2017.2. Zip pytz.zip
six The six – 1.11.0. Tar. Gz six.tar.gz
pandas Pandas – 0.20.2 – cp27 – cp27m – manylinux1_x86_64. Zip pandas.zip
scipy Scipy 0.19.0 – cp27 – cp27m – manylinux1_x86_64. Zip scipy.zip
scikit-learn Scikit_learn 0.18.1 – cp27 – cp27m – manylinux1_x86_64. Zip sklearn.zip

At this point, all package uploads are complete.

Of course, we can also use PyODPS for all uploadsResources to uploadInterface to complete, also only need to operate once. Which one to use depends on personal preference.

Write code validation

Let’s write a simple function that uses all the libraries. It’s best to import these third-party libraries in the function.

def test(x): from sklearn import datasets, svm from scipy import misc import numpy as np iris = datasets.load_iris() assert iris.data.shape == (150, 4) assert np.array_equal(np.unique(iris.target), [0, 1, 2]) clf = svm.LinearSVC() clf.fit(iris.data, Iris. Target) pred = clf.predict([[5.0, 3.6, 1.3, 0.25]]) assert pred[0] == 0. Face ().shape is not Nonereturn xCopy the code


This code is just an example, and the goal is to use all of the packages described above.

After writing the function, we write a simple map. Remember, make sure isolation is turned on at runtime. If it is not turned on at project level, it can be turned on at runtime.

from odps import options

options.sql.settings = {'odps.isolation.session.enable': True}Copy the code


You can also specify on the execute method to open Isolation for this execution.

Similarly, we can specify packages to use globally with options.df.libraries or at execute. Here, we specify all packages, including dependencies. Here is an example of calling the function you just defined.

hints = {
    'odps.isolation.session.enable': True
}
libraries = ['python-dateutil.zip'.'pytz.zip'.'six.tar.gz'.'pandas.zip'.'scipy.zip'.'sklearn.zip']

iris = o.get_table('pyodps_iris').to_df()

print iris[:1].sepal_length.map(test).execute(hints=hints, libraries=libraries)Copy the code


As you can see, our function works fine.

conclusion

For third-party libraries and their dependencies, if you have uploaded them, you can write the code directly and specify the libraries to use. Otherwise, you need to follow the tutorial to upload the third-party library.

As you can see, after the first step of uploading the package, each use will be elegant, just specify libraries.

The original link