background

The PyODPS DataFrame provides an interface similar to PANDAS for manipulating ODPS data. It also supports manipulating ODPS data locally with PANDAS and using a database.

In addition to supporting the Map and apply methods similar to PANDAS, the PyODPS DataFrame also provides the MapReduce API to extend pandas’ syntax to adapt to the big data environment.

PyODPS ‘custom functions are serialized into MaxCompute. The Python environment for MaxCompute contains only numpy, a third party package. How to use libraries containing C code such as pandas, scipy, or SciKit-learn in custom functions?

MaxCompute’s Isolation in Sprint 27 and later now makes it possible to use these packages in custom functions. PyODPS also requires at least version 0.7.4. I’ll explain how to use it in detail.

steps

Upload third-party packages (just do it once)

This step only needs to be done once, when the MaxCompute resource has these packages, this step is skipped.

These major Python packages now provide WHL packages that contain binaries for each platform, so finding a package that can run on MaxCompute is the first step.

Second, to run on MaxCompute, you need to include all dependencies, which can be tedious. We can take a look at the dependencies of each package (delete means included)

The package name	Rely on
pandas	numpy, python-dateutil, pytz, six
scipy	numpy
scikit-learn	numpy, scipy

To ensure that pandas, Scipy, and sciKit-learn are available, we need to upload the python-dateutil, Pytz, PANDAS, scipy, sklearn, and SIX packages.

We directly through the mirrors.aliyun.com/pypi/simple come to package. The first is python-dateutils:mirrors.aliyun.com/pypi/simple… . We find the latest version of the zip package, python-dateutil-2.6.0.zip, which is pure Python.

Rename it python-dateutil.zip and upload the resource through the MaxCompute Console.

add archive python-dateutil.zip;Copy the code

As with pytz, find pytz-2017.2. Upload not table.

Six finds six-1.11.0.tar.gz.

WHL: cp27-cp27m-manyLinux1_x86_64: WHL: cp27-cp27m-manyLinux1_x86_64: CP27-cp27m-manyLinux1_x86_64 Thus, we find the latest version of the package: pandas 0.20.2-cp27-cp27m-manyLinux1_x86_64.whl.

Here we change the suffix to zip, upload.

add archive pandas.zip;Copy the code

The same goes for the other packages, so let’s list them all:

The package name	The file name	Upload resource name
python-dateutil	Python – dateutil – server. Zip	python-dateutil.zip
pytz	Pytz – 2017.2. Zip	pytz.zip
six	The six – 1.11.0. Tar. Gz	six.tar.gz
pandas	Pandas – 0.20.2 – cp27 – cp27m – manylinux1_x86_64. Zip	pandas.zip
scipy	Scipy 0.19.0 – cp27 – cp27m – manylinux1_x86_64. Zip	scipy.zip
scikit-learn	Scikit_learn 0.18.1 – cp27 – cp27m – manylinux1_x86_64. Zip	sklearn.zip

At this point, all package uploads are complete.

Of course, we can also use PyODPS for all uploadsResources to uploadInterface to complete, also only need to operate once. Which one to use depends on personal preference.

Write code validation

Let’s write a simple function that uses all the libraries. It’s best to import these third-party libraries in the function.

def test(x): from sklearn import datasets, svm from scipy import misc import numpy as np iris = datasets.load_iris() assert iris.data.shape == (150, 4) assert np.array_equal(np.unique(iris.target), [0, 1, 2]) clf = svm.LinearSVC() clf.fit(iris.data, Iris. Target) pred = clf.predict([[5.0, 3.6, 1.3, 0.25]]) assert pred[0] == 0. Face ().shape is not Nonereturn xCopy the code

This code is just an example, and the goal is to use all of the packages described above.

After writing the function, we write a simple map. Remember, make sure isolation is turned on at runtime. If it is not turned on at project level, it can be turned on at runtime.

from odps import options

options.sql.settings = {'odps.isolation.session.enable': True}Copy the code

You can also specify on the execute method to open Isolation for this execution.

Similarly, we can specify packages to use globally with options.df.libraries or at execute. Here, we specify all packages, including dependencies. Here is an example of calling the function you just defined.

hints = {
    'odps.isolation.session.enable': True
}
libraries = ['python-dateutil.zip'.'pytz.zip'.'six.tar.gz'.'pandas.zip'.'scipy.zip'.'sklearn.zip']

iris = o.get_table('pyodps_iris').to_df()

print iris[:1].sepal_length.map(test).execute(hints=hints, libraries=libraries)Copy the code

As you can see, our function works fine.

conclusion

For third-party libraries and their dependencies, if you have uploaded them, you can write the code directly and specify the libraries to use. Otherwise, you need to follow the tutorial to upload the third-party library.

As you can see, after the first step of uploading the package, each use will be elegant, just specify libraries.

The original link

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Use pandas, Scipy, and SciKit-learn in PyODPS DataFrame custom functions

background

steps

Upload third-party packages (just do it once)

Write code validation

conclusion

Use pandas, Scipy, and SciKit-learn in PyODPS DataFrame custom functions

background

steps

Upload third-party packages (just do it once)

Write code validation

conclusion

Related Posts

Class file structure

Redis guide (1) : prequel | why no time to choose

The implementation of the ThreadLocal variable