• Natural Language Processing Made Easy — using SpaCy (in Python)
  • Shivam Bansal
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: lsvih
  • Proofread by: Yzgyyang, SQrthree

Use Python+spaCy for easy natural language processing

Introduction to the

Natural language processing (NLP) is one of the most important parts of artificial intelligence. It plays a key role in many intelligent applications such as chatbots, text extraction, multilingual translation and opinion recognition. Nlp-related companies in the industry are realizing that when dealing with unstructured text data, it’s not just about accuracy, it’s also about getting the results you want quickly.

NLP is a very broad field, including text classification, entity recognition, machine translation, question answering system, concept recognition and other sub-fields. In my most recent article, I explored many of the tools and components used to implement NLP. In that article, I was more describing the great library NLTK (Natural Language Toolkit).

In this article, I will share spaCy, the most powerful and advanced NLP Python library available today.


Content of the feed

  1. SpaCy introduction and installation method
  2. SpaCy’s pipes and properties

    • Tokenization
    • The part of speech tagging
    • Entity recognition
    • Dependency parsing
    • Noun phrases
  3. Integrated word vector computation

  4. Machine learning with spaCy
  5. Compared with NLTK and CoreNLP

1. SpaCy introduction and installation method

1.1 introduction

SpaCy is written by Cython, aC extension of Python that aims to make Python programs perform as well as C programs, so it runs efficiently. SpaCy provides a set of concise apis for users to use, and implements the underlying layer based on already trained machine learning and deep learning models.


1.2 installation

SpaCy and its data and models can be easily installed using PIP and installation tools. Use the following command to install spaCy on your computer:

sudo pip install spacyCopy the code

If you are using Python3, use “pip3” instead of “PIP”.

Or you can download the source code here, unzip it and run the following command to install it:

python setup.py installCopy the code

After spacy is installed, run the following command to download all datasets and models:

python -m spacy.en.download allCopy the code

With everything in place, you are now free to explore and use Spacy.

2. SpaCy’s Pipeline and Properties

The use of spaCy, along with its various properties, is achieved by creating pipes. SpaCy creates the pipe when the model is loaded. In the spaCy package, a variety of modules are provided that contain information about vocabulary, training vectors, syntax, and entities for language processing.

Next, we’ll load the default module (english-core-Web module).

Import spacy NLP = spacy. Load (" en ")Copy the code

The “NLP” object is used to create a document, obtain a linguistic annotation, and other NLP properties. First we create a Document to load the text data into the pipe. I used hotel review data from Tripadvisor. The data file can be downloaded here.

document = unicode(open(filename).read().decode('utf8'))
document = nlp(document)Copy the code

The document is now a class of the Spacy. English model and is associated with a number of properties. You can list all document (or token) attributes using the following command:

dir(document)
>> [ 'doc'.'ents',...'mem']Copy the code

It outputs various attributes in document, such as token, token index, part-of-speech tagging, entity, vector, emotion, word, etc. Let’s take a look at some of these attributes.

2.1 Tokenization

SpaCy’s documents can be split into single sentences in the tokenized process, and these single sentences can be further divided into words. You can read these words by walking through the document:

# the first word of document
document[0]
>> Nice

The last word in # document
document[len(document)-5]
>> boston

List the sentences in document
list(document.sents)
>> [ Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
...
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).]Copy the code

2.2 POS Tag

Part-of-speech tagging refers to the part of speech of words in grammatically correct sentences. These annotations can be used for information filtering, statistical modeling, or text parsing based on certain rules.

Take a look at all the pos tags in our document:

Get all annotations
all_tags = {w.pos: w.pos_ for w in document}
>> {97:  u'SYM', 98: u'VERB', 99: u'X', 101: u'SPACE', 82: u'ADJ', 83: u'ADP', 84: u'ADV', 87: u'CCONJ', 88: u'DET', 89: u'INTJ', 90: u'NOUN', 91: u'NUM', 92: u'PART', 93: u'PRON', 94: u'PROPN', 95: u'PUNCT'}

# the pos tagging of the first sentence in document
for word in list(document.sents)[0]:  
    print word, word.tag_
>> ( Nice, u'JJ') (place, u'NN') (Better, u'NNP') (than, u'IN') (some, u'DT') (reviews, u'NNS') (give, u'VBP') (it, u'PRP') (creit, u'NN') (for, u'IN') (., u'. ')Copy the code

Take a look at the most commonly used words in Document. I have written pre-processing and text data cleansing functions in advance.

# Some parameter definitionsNoisy_pos_tags = [" PROP "] min_token_length = 2Check whether token is a function of noise
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise
def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()

# The most commonly used words in comments
from collections import Counter
cleaned_list = [cleanup(word.string) for word in document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)
>> [( u'hotel', 683), (u'room', 652), (u'great', 300),  (u'sheraton', 285), (u'location', 271)]Copy the code

2.3 Entity Recognition

SpaCy has a fast entity recognition model that can extract entity phrases from documents. It can recognize various types of entities, such as people’s names, locations, institutions, dates, numbers, etc. You can read these entities via the “.ents “property.

Let’s get all types of named entities in our document:

labels = set([w.label_ for w in document.ents])
for label in labels:
    entities = [cleanup(e.string, lower=False) for e in document.ents if label==e.label_]
    entities = list(set(entities))
    print label,entitiesCopy the code

2.4 Dependency parsing

One of spaCy’s most powerful features is its ability to invoke lightweight apis for fast and accurate dependency analysis. The analyzer can also be used for sentence boundary detection and phrase segmentation. The dependency relationship can be read by using the. Children,. Root, and. Ancestor properties.

# Retrieve all comments that contain the word "hotel" in the sentence
hotel = [sent for sent in document.sents if 'hotel' in sent.string.lower()]

Create a dependency tree
sentence = hotel[2] for word in sentence:
print word, ':', str(list(word.children))
>> A :  []  cab :  [A, from]
from :  [airport, to]
the :  []
airport :  [the]
to :  [hotel]
the :  [] hotel :  
[the] can :  []
be :  [cab, can, cheaper, .]
cheaper :  [than] than :  
[shuttles]
the :  []
shuttles :  [the, depending]
depending :  [time] what :  []
time :  [what, of] of :  [day]
the :  [] day :  
[the, go] you :  
[]
go :  [you]
. :  []Copy the code

Parse the dependencies of all sentences with the word “hotel” in the center and check which adjectives people use for hotel. I created a custom function for parsing dependencies and associated part-of-speech tagging.

Check all adjectives that modify a word
def pos_words (sentence, token, ptag):
    sentences = [sent for sent in sentence.sents if token in sent.string]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            if character in word.string:
                   pwrds.extend([child.string.strip() for child in word.children
                                                      if child.pos_ == ptag] )
    return Counter(pwrds).most_common(10)

pos_words(document, 'hotel', "ADJ") > > [(u'other', 20), (u'great', 10), (u'good', 7), (u'better', 6), (u'nice', 6), (u'different', 5), (u'many', 5), (u'best', 4), (u'my', 4), (u'wonderful'And 3)]Copy the code

2.5 Noun phrase (NP)

Dependency trees can also be used to generate noun phrases:

# Generate noun phrases
doc = nlp(u'I love data science on analytics vidhya')
for np in doc.noun_chunks:
    print np.text, np.root.dep_, np.root.head.text
>> I nsubj love
   data science dobj love
   analytics pobj onCopy the code

3. Integrate word vector

SpaCy provides built-in integrations of vector-value algorithms that reflect the real expressed information in a word. It uses GloVe to generate vectors. GloVe is an unsupervised learning algorithm for obtaining vectors representing words.

Let’s create some word vectors and do some interesting things with them:

from numpy import dot
from numpy.linalg import norm
from spacy.en import English
parser = English()

# Generate a word vector for "apple"
apple = parser.vocab[u'apple']

# Cosine similarity calculation function
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
others = list({w for w in parser.vocab ifw.has_vector and w.orth_.islower() and w.lower_ ! = unicode("apple")})

# Sort by similarity
others.sort(key=lambda w: cosine(w.vector, apple.vector))
others.reverse()


print "top most similar words to apple:"
for word in others[:10]:
    print word.orth_
>> apples iphone f ruit juice cherry lemon banana pie mac orangeCopy the code

4. Machine learning for text using spaCy

Integrating spaCy into a machine learning model is simple and straightforward. Let’s make a custom text classifier using Sklearn. We will use cleaner, Tokenizer, Vectorizer, classifier components to create an SkLearn pipe. The Tokenizer and Vectorizer will be built using modules we’ve customized with spaCy.

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

import string
punctuations = string.punctuation

from spacy.en import English
parser = English()

Custom Transformer with spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

Basic function for text cleaning
def clean_text(text):     
    return text.strip().lower()Copy the code

Now let’s use spaCy’s parser and some basic data cleaning functions to create a custom Tokenizer function. It’s worth noting that you can use word vectors instead of text features (which can be greatly improved using deep learning models).

Create a spaCy Tokenizer that parses sentences and generates tokens
You can also use the term vector function instead
def spacy_tokenizer(sentence):
    tokens = parser(sentence)
    tokens = [tok.lemma_.lower().strip() iftok.lemma_ ! ="-PRON-" else tok.lower_ for tok in tokens]
    tokens = [tok for tok in tokens if (tok not in stopwords and tok not in punctuations)]     return tokens

Create a Vectorizer object that generates feature vectors to customize spaCy's TokenizerVectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) classifier = LinearSVC()Copy the code

Now you can create the pipe, load the data, and run the classification model.

# Create pipes for text cleaning, tokenize, vectorization, categorization
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

# Load sample data
train = [('I love this sandwich.'.'pos'),          
         ('this is an amazing place! '.'pos'),
         ('I feel very good about these beers.'.'pos'),
         ('this is my best work.'.'pos'),
         ("what an awesome view".'pos'),
         ('I do not like this restaurant'.'neg'),
         ('I am tired of this stuff.'.'neg'),
         ("I can't deal with this".'neg'),
         ('he is my sworn enemy! '.'neg'),          
         ('my boss is horrible.'.'neg')]
test =   [('the beer was good.'.'pos'),     
         ('I do not enjoy my job'.'neg'),
         ("I ain't feelin dandy today.".'neg'),
         ("I feel amazing!".'pos'),
         ('Gary is a good friend of mine.'.'pos'),
         ("I can't believe I'm doing this.".'neg')]

# Create the model and calculate the accuracy
pipe.fit([x[0] for x in train], [x[1] for x in train])
pred_data = pipe.predict([x[0] for x in test])
for (sample, pred) in zip(test, pred_data):
    print sample, pred
print "Accuracy:", accuracy_score([x[1] for x in test], pred_data)

>>    ('the beer was good.'.'pos') pos
      ('I do not enjoy my job'.'neg') neg
      ("I ain't feelin dandy today.".'neg') neg
      ('I feel amazing! '.'pos') pos
      ('Gary is a good friend of mine.'.'pos') pos
      ("I can't believe I'm doing this.".'neg') neg
      Accuracy: 1.0Copy the code

5. Comparison with other libraries

Spacy is a very powerful, industrial-grade NLP package that meets the needs of most NLP tasks. You might be wondering: Why is that?

Let’s contrast Spacy with two other well-known Python tools that implement NLP, CoreNLP and NLTK.

Support list

function Spacy NLTK Core NLP
Easy installation Y Y Y
Python API Y Y N
Multilingual support N Y Y
participles Y Y Y
The part of speech tagging Y Y Y
clauses Y Y Y
Dependency analysis Y N Y
Entity recognition Y Y Y
Word vector computing integration Y N N
Sentiment analysis Y Y Y
A total of refers to eliminate N N Y

Speed: speed of the main features (Tokenizer, Tagging, Parsing)

library Tokenizer Tagging Parsing
spaCy 0.2 ms 1ms 19ms
CoreNLP 2ms 10ms 49ms
NLTK 4ms 443ms

Accuracy: entity extraction results

library accuracy Recall F-Score
spaCy 0.72 0.65 0.69
CoreNLP 0.79 0.73 0.76
NLTK 0.51 0.65 0.58

conclusion

This article discussed spaCy, a Python-based library that implements NLP entirely. We demonstrated spaCy’s usability, speed, and accuracy through a number of use cases. Finally, we compare CoreNLP and NLTK with other well-known NLP libraries.

If you really understand what this article is about, then you can definitely implement all kinds of challenging textual data and NLP problems.

Hope you enjoyed this article, and if you have questions, questions, or other thoughts, please leave them in the comments.

About the author:

Shivam Bansal

Shivam Bansal is a data scientist with extensive experience in NLP and machine learning. He is eager to learn and wants to solve some challenging analytical problems.

  • twitter.com/shivamshaz
  • www.linkedin.com/in/shivamba…
  • github.com/shivam5992

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. Android, iOS, React, front end, back end, product, design, etc. Keep an eye on the Nuggets Translation project for more quality translations.