Written by Thomas Wolf, Heart of the Machine.
Cython is a toolkit that allows you to compile C in Python. That’s why Numpy and PANDAS are so fast. Cython is a superset of Python. In this article, the author will introduce us to his GitHub project NeuralCoref V3.0 and explain how spaCy and Cython can be used to implement NLP projects about 100 times faster than Python.
Jupyter Notebook address: github.com/huggingface…
After we released the Python coReference Resolution Package last year, we received wonderful feedback from the community, and people started using it in many applications, some of which were very different from our original conversation use cases.
We found that while conversational information is very fast, it can be very slow for long news articles.
I decided to investigate this problem in detail, and the end result is NeuralCoref V3.0, which is about 100 times faster than the older version with the same accuracy (a few thousand words per second) while maintaining the ease of use and compatibility of the Python library.
NeuralCoref v3.0: github.com/huggingface…
I would like to share some of my experiences about the project in this article, in particular:
- How to design a high-speed module in Python;
- How to use spaCy’s internal data structures to efficiently design ultra-high speed NLP functions.
So I’m cheating a little bit here, because we’re going to talk about Python, but we’re also going to talk a little bit about the magic of Cython. But, you know what? Cython is a superset of Python, so don’t let it scare you off!
Your Python program is already a Cython program.
There are several situations where you may need to speed things up, such as:
- You are developing a production module for NLP in Python;
- You are using Python to compute and analyze large NLP data sets;
- You are preprocessing large training sets for deep learning frameworks such as PyTorch/TensorFlow, or the processing logic in your deep learning batch loader is too onerous, which can slow down the training.
Again: I synchronized a Jupyter Notebook that contains the examples I discuss in this article. Give it a try!
Jupyter Notebook:github.com/huggingface…
Step one: Profiling
The first thing to know is that most of your code will probably run fine in a pure Python environment, but if you pay attention, some of these bottleneck functions can make your code orders of magnitude faster.
Therefore, you should first analyze your Python code and find out where the bottlenecks are. Using the following cProfile is an option:
import cProfile
import pstats
import myslowmodule
cProfile.run('myslowmodule.run()'.'restats')
p = pstats.Stats('restats')
p.sortstats('cumulative').printstats(30)
Copy the code
If you use neural networks, you may find that the bottleneck part is several loops and involves Numpy array operations.
So, how do we speed up this loop code?
Use some Cython in Python to speed up loops
Let’s analyze this problem with a simple example. Suppose we have a bunch of rectangles and store them in a list of Python objects, such as an instance of the Rectangle class. The main job of our module is to iterate over the list in order to calculate how many rectangles are larger than a particular threshold.
Our Python module is very simple and looks like this:
from random import random
class Rectangle:
def __init__(self, w, h):
self.w = w
self.h = h
def area(self):
return self.w * self.h
def check_rectangles(rectangles, threshold):
n_out = 0
for rectangle in rectangles:
if rectangle.area() > threshold:
n_out += 1
return n_out
def main():
n_rectangles = 10000000
rectangles = list(Rectangle(random(), random()) for i inRange (rectangles)) n_out = check_rectangles(rectangles, threshold=0.25)print(n_out)
Copy the code
The check_rectangles function is the bottleneck! It loops through a large number of Python objects, which can be slow because the Python interpreter does a lot of work with each iteration (finding the area method in the class, packing and unpacking arguments, calling the Python API…). .
Cython will help us speed up the cycle.
The Cython language is a superset of Python and contains two types of objects:
- Python objects are objects that we manipulate in regular Python, such as numbers, strings, lists, class instances…
- Cython C objects are C or C ++ objects such as double, int, float, struct, vectors. These can be compiled by Cython in super-fast underlying code.
A fast loop is just one loop in a Cython program that can only access Cython C objects.
A straightforward way to design such a loop is to define a C structure, which will contain all the elements we need in our calculation: in our case, the length and width of the rectangle.
We can then store the list of rectangles in a C array of this structure and pass this array to our check_rectangle function. This function now takes an array of C as input, so it is defined as a Cython function with the cdef keyword instead of def (note that cdef is also used to define Cython C objects).
Here is a quick Cython version of our Python module:
from cymem.cymem cimport Pool
from random import random
cdef struct Rectangle:
float w
float h
cdef int check_rectangles(Rectangle* rectangles, int n_rectangles, float threshold):
cdef int n_out = 0
# C arrays contain no size information => we need to give it explicitly
for rectangle in rectangles[:n_rectangles]:
if rectangles[i].w * rectangles[i].h > threshold:
n_out += 1
return n_out
def main():
cdef:
int n_rectangles = 10000000
floatRectangles = rectangles = rectangles = rectangles = rectangles = rectangles = rectangles = rectangles = rectangles = rectangles = rectangles = rectangles *> rectangles.for i in range(n_rectangles):
rectangles[i].w = random()
rectangles[i].h = random()
n_out = check_rectangles(rectangles, n_rectangles, threshold)
print(n_out)
Copy the code
We use native C pointer arrays here, but you can also choose from other options, especially C ++ structures such as vectors, pairs, queues, etc. In this snippet, I also use Cymem’s handy Pool () memory management object to avoid having to manually free allocated C arrays. When Pool is garbage collected by Python, it automatically frees the memory we allocated using it.
The spaCy API’s Cython Conventions is a good reference to the practical use of Cython in NLP.
SpaCy: spaCy. IO
Cython Conventions: spacy. IO/API/cython#…
Let’s try this code out!
There are many ways to test, compile, and publish Cython code! Cython can even be used directly in A Jupyter Notebook like Python.
Jupyter Notebook: cython. Readthedocs. IO/en/latest/s…
First install Cython using PIP Install Cython
The first test in Jupyter
Load the Cython plug-in into the Jupyter Notebook using % load_ext Cython.
You can now write Cython code just like Python code using the black magic command %% cython.
If you encounter a compilation error while executing a Cython unit, be sure to check the Jupyter terminal output for complete information.
In most cases, you will lose the – + flag after %% Cython is compiled to C ++ (for example, if you use the spaCy Cython API) or import numpy (if the compiler does not support Numpy).
As I mentioned at the beginning, check out the synchronous Jupyter Notebook for this article, which contains all the examples discussed in this article.
Write, use, and distribute Cython code
The Cython code is written in a.pyx file. These files are compiled into C or C ++ files by the Cython compiler, and then compiled into bytecode files by the system’s C compiler. Python interpreters can use bytecode files.
You can load.pyx files directly in Python using PyximPort:
>>> import pyximport; pyximport.install()
>>> import my_cython_module
Copy the code
You can also build your Cython code as a Python package and import/distribute it as a regular Python package, as shown below. This may take some time to work, especially on full platforms. If you need a working example, spaCy “s Install Script is a fairly comprehensive one.
Import tutorial: cython. Readthedocs. IO/en/latest/s…
Before we move to some NLP, let “s quickly talk about the def, cdef and cpdef keywords, because they are the main things you need to grab to start using Cython.
Before we move on to NLP, let’s quickly discuss the def, cdef, and cpdef keywords, because they are the main things you need to know to get started with Cython.
There are three types of functions you can use in Cython programs:
- Python functions defined with the commonly used keyword def. They can be used as input and output Python objects. It is also possible to use both Python and C/C ++ objects internally, and to call Cython and Python functions.
- Cython functions defined with the cdef keyword. They can be used internally as input and output Python and C/C ++ objects. These functions are not accessible from the Python space (that is, the Python interpreter and other pure Python modules that can import Cython modules), but can be imported by other Cython modules.
- Cython functions defined with the cpdef keyword are just like Cython functions defined by cdef, but they also provide a Python wrapper, They can therefore be called from Python Spaces (with Python objects as input and output) as well as from other Cython modules (with C/C ++ or Python objects as input).
The cdef keyword has another use for defining Cython C/C ++ objects in code. Unless objects are defined with this keyword, they are treated as Python objects (and thus slow to access).
Use Cython and spaCy to speed up NLP
These things are nice and fast, but…… We are not yet integrated into NLP! There are no string manipulations, no Unicode encoding, and none of the subtle connections we’re lucky to have in natural language processing.
The official Cython documentation even advises against using C strings:
As a general rule: unless you know what you’re doing, avoid USING C strings whenever possible and use Python string objects instead.
So how do we design fast loops in Cython when using strings?
SpaCy will help us.
SpaCy’s approach to this problem is very clever.
Converts all strings to a 64-bit hash code
All Unicode strings in spaCy (the text of tokens, their lower-case text, lemma form, POS key tags, parse tree dependency tags, named entity tags…) Are stored in a single data structure called StringStore, where they are indexed by 64-bit hash, or C uint64_t.
The StringStore object implements a lookup table between Python Unicode strings and 64-bit hash codes.
It can be accessed from anywhere in spaCy and from any object (see figure above), such as NLP. Vocab. strings, doc.vocab.strings, or SP.doc.vocab.string.
When a module needs to perform quick processing on some tokens, only c-level 64-bit hash codes are used instead of strings. Calling the StringStore lookup table returns the Python Unicode string associated with the hash code.
SpaCy does more than that, however, by giving us access to fully covered C constructs of documents and vocabularies that we can use in Cython loops without having to customize.
SpaCy’s internal data structure
The main data structure associated with a spaCy Doc object is the Doc object, which has a token sequence of processed strings (” words “) and all the annotations called doc.c in the C object, which is an array of TokenC structures.
The TokenC structure contains all the information we need about each token. This information is stored as a 64-bit hash code that can be re-associated to unicode strings, as we just saw.
To see what’s in these C constructs, just look at SpaCy’s newly created Cython API Doc.
Let’s look at a simple example of NLP processing.
Use spaCy and Cython for fast NLP processing
Suppose we have a textual data set that needs to be analyzed
import urllib.request
import spacy
with urllib.request.urlopen('https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/valid.txt') as response:
text = response.read()
nlp = spacy.load('en')
doc_list = list(nlp(text[:800000].decode('utf8')) for i in range(10))
Copy the code
I’ve written a script on the left that generates a list of 10 documents, each about 170K words, for spaCy to parse. We could also generate 170K documents of 10 words per document (such as a conversation data set), but creation was slower, so we stuck with 10 documents.
We want to perform some NLP tasks on this data set. For example, we want to count the number of times the word “run” is used as a noun in the data set (that is, spaCy marks the “NN” part of speech).
A straightforward Python loop does this:
def slow_loop(doc_list, word, tag):
n_out = 0
for doc in doc_list:
for tok in doc:
if tok.lower_ == word and tok.tag_ == tag:
n_out += 1
return n_out
def main_nlp_slow(doc_list):
n_out = slow_loop(doc_list, 'run'.'NN')
print(n_out)
Copy the code
But it’s also slow! On my laptop, this code takes about 1.4 seconds to get the result. If we had a million documents, it would take more than a day to get the results.
We can use multithreading, but it’s usually not a good solution in Python because you have to deal with the GIL. Also, note that Cython can use multithreading too! And this is actually probably the best part of Cython, since GIL is freed and we can run at full speed. Cython basically calls OpenMP directly.
Now let’s try to speed up our Python code using spaCy and part of Cython.
First, we must consider data structures. We will need a C array for the data set, with Pointers to the TokenC array for each document. We also need to convert the test strings we used (” run “and” NN “) to 64-bit hash codes.
When all the data we need is in C objects, we can iterate over the data set at the speed of C.
Here is an example of how to use spaCy written in Cython:
%%cython -+
import numpy # Sometime we have a fail to import numpy compilation error if we don't import numpy
from cymem.cymem cimport Pool
from spacy.tokens.doc cimport Doc
from spacy.typedefs cimport hash_t
from spacy.structs cimport TokenC
cdef struct DocElement:
TokenC* c
int length
cdef int fast_loop(DocElement* docs, int n_docs, hash_t word, hash_t tag):
cdef int n_out = 0
for doc in docs[:n_docs]:
for c in doc.c[:doc.length]:
if c.lex.lower == word and c.tag == tag:
n_out += 1
return n_out
def main_nlp_fast(doc_list):
cdef int i, n_out, n_docs = len(doc_list)
cdef Pool mem = Pool()
cdef DocElement* docs = <DocElement*>mem.alloc(n_docs, sizeof(DocElement))
cdef Doc doc
for i, doc in enumerate(doc_list): # Populate our database structure
docs[i].c = doc.c
docs[i].length = (<Doc>doc).length
word_hash = doc.vocab.strings.add('run')
tag_hash = doc.vocab.strings.add('NN')
n_out = fast_loop(docs, n_docs, word_hash, tag_hash)
print(n_out)
Copy the code
The code is a bit long because we have to declare and populate the C structure in main_NLP_fast before calling the Cython function. (If you use low-level constructs multiple times in your code, designing our Python code using Cython extension types wrapped in C constructs is a more elegant choice than populating C constructs each time. This is the structure of most spaCy, which is a very elegant approach that combines speed, low memory, and ease of interface with external Python libraries and functions.
But it’s also a lot faster! In my Jupyter Notebook, this Cython code runs about 20 milliseconds, which is about 80 times faster than our pure Python loop.
The sheer speed of the modules written in the Jupyter Notebook cell is equally impressive, and can provide a native interface to other Python modules and functions: scanning around 17 million words in 30ms means we process up to 80 million words per second.
This concludes our quick introduction to NLP using Cython. I hope you like it.
There are many other things Cython could say, but this will take us far from the subject. The best place to start is probably Cython tutorials overview and spaCy’s Cython Page for NLP.
Original link: medium.com/huggingface…