NLP in Python
Translation: Chen Zhiyan
Proofreading: and Chinese
This article contains 2700 words and is recommended to be read for 6 minutes.
Natural language processing is a difficult problem in data science. In this article, we will introduce an industrial-grade Python library.
Natural language processing (NLP) is one of the most interesting subfields in data science, with a growing number of data scientists looking to develop solutions involving unstructured text data. Despite this, many applied data scientists (all with STEM and social science backgrounds) still lack NLP experience.
In this article, I’ll explore some basic NLP concepts and show how to implement them using the increasingly popular Python spaCy package. This article is suitable for NLP beginners, but it assumes that the reader has a knowledge of Python.
Are you talking about spaCy?
SpaCy is a relatively new package, “Industrial-grade Python natural language kit”, created by Matt Honnibal at Explosion AI. Development. It was designed primarily for application data scientists, which means it doesn’t require users to decide which algorithm to use for common tasks, and it’s incredibly fast (it’s done in Cython). If you’re familiar with the Python data science stack, spaCy is NLP’s NUMpy, and while it’s certainly at the bottom of the heap, it’s intuitive and fairly high performance.
So, what can it do?
SpaCy provides a one-stop shop for tasks commonly used in any NLP project. Include:
-
Symbolic (Tokenizatioin)
-
Stem Extraction (LEMM)
-
Part-of-speech tagging
-
Entity Recognition
-
Dependency parsing
-
Sentence Recognition
-
Word-to-vector Transformation
-
Cleaning and normalizing Text: Many convenient ways to clean and normalize text
I’ll give you a high-level overview of these capabilities and show you how to access them using spaCy.
So let’s get started.
First, we load spaCy’s pipeline, which by convention is stored in a variable called NLP. It takes a few seconds to declare this variable because spaCy preloads the model and data to the front end to save time. In fact, doing so can do some of the heavy lifting ahead of time, making it less expensive for NLP to parse the data. Note that the language model we use here is English, as well as a full-featured German model that can be tokenized in multiple languages (discussed below).
We call NLP in the sample text to create the Doc object. The Doc object is the NLP task container for the text itself, which splits the text into literals (Span objects) and elements (Token objects), which actually contain no data. It is worth noting that the Token and Span objects actually have no data. Instead, they contain Pointers to data in Doc objects and are lazily evaluated (that is, on request). Most of spaCy’s core functions are implemented by means of Doc (n=33), Span (n=29), and Token (n=78) objects.
In[1]:import spacy
. : nlp = spacy.load(“en”)
…: doc = nlp(“The big grey dog ate all of the chocolate, but fortunately he wasn’t sick!”)
Word segmentation (tokenization)
Word segmentation is a basic step in many natural language processing tasks. Segmentation is the process of breaking up a piece of text into words, symbols, punctuation marks, Spaces, and other elements to create tokens. An easy way to do this is to split strings over Spaces:
In[2]:doc.text.split()
. : Out[2]: [‘The’, ‘big’, ‘grey’, ‘dog’, ‘ate’, ‘all’, ‘of’, ‘the’, ‘chocolate,’, ‘but’, ‘fortunately’, ‘he’, “wasn’t”, ‘sick!’]
On the face of it, the word segmentation with blank space works fine. Notice, however, that it ignores punctuation and doesn’t separate verbs from adverbs (“was”, “n’t”). In other words, it’s too naive to recognize the elements of text that help us (and machines) understand their structure and meaning. Let’s look at how spaCy deals with this:
In[3]:[token.orth_ for token in doc]
. :
Out[3]: [‘The’, ‘big’, ‘grey’, ‘dog’, ‘ate’, ‘all’, ‘of’, ‘the’, ‘chocolate’, ‘,’, ‘but’, ‘fortunately’, ‘he’, ‘was’, “n’t”, ‘ ‘, ‘sick’, ‘!’]
Here, we access the.orth_ method for each token, which returns a string representing the token instead of a SpaCytoken object. This may not always be desirable, but it’s worth noting. SpaCy recognizes punctuation marks and is able to separate them from the tokens of words. Many of SpaCy’s token methods return both string and integer values for the literal to be processed: methods with an underscore suffix return a string and methods without an underscore suffix return an integer. Such as:
In[4]:[(token, token.orth_, token.orth) for token in doc]
. :
Out[4]:[(The, ‘The’, 517), (big, ‘big’, 742), (grey, ‘grey’, 4623), (dog, ‘dog’, 1175), (ate, ‘ate’, 3469), (all, ‘all’, 516), (of, ‘of’, 471), (the, ‘the’, 466), (chocolate, ‘chocolate’, 3593), (,, ‘,’, 416), (but, ‘but’, 494), (fortunately, ‘fortunately’, 15520), (he, ‘he’, 514), (was, ‘was’, 491), (n’t, “n’t”, 479), ( , ‘ ‘, 483), (sick, ‘sick’, 1698), (!, ‘!’, 495)]
In[5]: [token.orth_ for token in doc if not token.is_punct | token.is_space]
. :
Out[5]: [‘The’, ‘big’, ‘grey’, ‘dog’, ‘ate’, ‘all’, ‘of’, ‘the’, ‘chocolate’, ‘but’, ‘fortunately’, ‘he’, ‘was’, “n’t”, ‘sick’]
Cool, right?
stemming
The task associated with word segmentation is stem extraction. Stem extraction is the process of reducing a word to its basic form, the parent word. Words used in different ways often have roots with the same meaning. For example, practice, practice, and observation practically refer to the same thing. It is often necessary to standardize words of similar meaning down to their basic form. With SpaCy, we access the basic form of each word using the tagged.lemma_ method.
In[6]:practice = “practice practiced practicing”
. : nlp_practice = nlp(practice)
. : [word.lemma_ for word in nlp_practice]
. :
Out[6]: [‘practice’, ‘practice’, ‘practice’]
Why is this useful? One immediate use case is machine learning, especially text categorization. For example, stem extraction of text is required before creating “word bags” to avoid word repetition. Therefore, the model can more clearly describe word usage patterns across multiple documents.
POS Tagging
Part-of-speech tagging is the process of assigning grammatical attributes (such as nouns, verbs, adverbs, adjectives, etc.) to words. Words that share the same part-of-speech markers tend to follow similar syntactic structures, which can be useful in rule-based processing.
For example, in a given event description, we might want to determine who owns what. We can do this by making use of the possessive case (to provide the syntax of the text). SpaCy uses the popular Penn Treebank POS tag (see here). With SpaCy, coarse-grained POS tags and fine-grained POS tags can be accessed using the.pos_ and.tag_ methods, respectively. Here, I access the fine-grained POS tag:
In[7]:doc2 = nlp(“Conor’s dog’s toy was hidden under the man’s sofa in the woman’s house”)
. : pos_tags = [(i, i.tag_) fori indoc2]
. : pos_tags
. :
Out[7]:
[(Conor,’NNP’),
(‘s, ‘POS’),
(dog,’NN’),
(‘s, ‘POS’),
(toy,’NN’),
(was,’VBD’),
(hidden,’VBN’),
(under,’IN’),
(the,’DT’),
(man,’NN’),
(‘s, ‘POS’),
(sofa,’NN’),
(in,’IN’),
(the,’DT’),
(woman,’NN’),
(‘s, ‘POS’),
(house,’NN’)]
We can see that the’s tag is marked POS. We can use this tag to extract the owner and what they own:
In[8]:owners_possessions = []
…: for i in pos_tags: …: if i[1] == “POS”:
. : owner = i[0].nbor(-1)
. : possession = i[0].nbor(1)
. : owners_possessions.append((owner, possession))
. :
. : owners_possessions
. :
Out[8]: [(Conor, dog), (dog, toy), (man, sofa), (woman, house)]
This returns a list of tuples owned by the owner. If you want to be super Python expert at this, you can make a full list of it (which I think is the best!). :
In[9]: [(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == “POS”]
. : Out[9]: [(Conor, dog), (dog, toy), (man, sofa), (woman, house)]
In this case, we use the.nbor method for each tag, which returns a tag adjacent to the tag.
Entity recognition
Entity recognition is the process of classifying specified entities in text into pre-defined categories, such as person, place, organization, date, etc. SpaCy uses statistical models to categorize various models, including individuals, events, works of art, and nationality/religion (see full list file).
For example, let’s pick the first two sentences from Barack Obama’s Wikipedia entry. We will parse this text and then use the.ents method of the Doc object to access the identified entity. By calling Doc’s method, we can access other labeling methods, especially.label_ and.label:
In[10]:wiki_obama = “””Barack Obama is an American politician who served as
. : the 44th President of the United States from 2009 to 2017.He is the first
. : African American to have served as president,
. : as well as the first born outside the contiguous United States.”””
… :
… : nlP_obama = NLP (wiki_obama)
… : [(I, i.label_, i.label) for I in NLP_obama.ents]
. :
Out[10]: [(Barack Obama, ‘PERSON’, 346), (American, ‘NORP’, 347), (the United States, ‘GPE’, 350), (2009 to 2017, ‘DATE’, 356), (first, ‘ORDINAL’, 361), (African, ‘NORP’, 347), (American, ‘NORP’, 347), (first, ‘ORDINAL’, 361), (United States, ‘GPE’, 350)]
You can see what entities are identified by the model and how accurate they are in this case. PERSON is self-explanatory; NORP is a national or religious group; GGPE identifies location (city, country, etc.); DATE identifies a specific DATE or range of dates, and ORDINAL identifies a word or number that indicates some type of order.
While we’re on the subject of Doc methods, it’s worth mentioning spaCy’s sentence identifiers. It is not uncommon for NLP tasks to want to break documents into sentences. Accessing the Doc’s s. ents method with SpaCy is not difficult:
In[11]:for ix, sent in enumerate(nlp_obama.sents, 1):
. : print(“Sentence number {}: {}”.format(ix, sent))
. : Sentence number 1: Barack Obama is an American politician who served as the 44th President of the United States from 2009 to 2017.Sentence number 2: He is the first African American to have served as president, as well as the first born outside the contiguous United States.
So far. In future articles, I’ll show you how to use spaCy in complex data mining and ML tasks.
TrueSight is an AIOps platform powered by machine learning and analytics that improves the efficiency of IT operations by addressing the complexity of multiple clouds and increasing the speed of digital transformation.
The original link: https://dzone.com/articles/nlp-in-python
Introduction to the translator
Chen Yanzhi, graduated from Beijing Jiaotong University with a master’s degree in communication and control Engineering, has served successively as an engineer of Great Wall Computer Software and System Company and an engineer of Datang Microelectronics Company, and is currently working as a technical support of Beijing Wuyichaoqun Technology Co., LTD. Currently, he is engaged in the operation and maintenance of intelligent translation teaching system, and has accumulated some experience in artificial intelligence deep learning and natural language processing (NLP). In my spare time, I am interested in translating and creating works, mainly including IEC-ISO 7816, Iraqi Oil Engineering Project, Declaration of New Fiscal Doctrine, etc. Among them, “Declaration of New Fiscal Doctrine” translated from Chinese to English was published in GLOBAL TIMES. I can use my spare time to join the translation volunteer group of THU Data platform, hoping to communicate and share with you and make progress together.
Translation group recruitment information
Job description: I need a careful mind to translate good foreign articles into fluent Chinese. If you are an international student majoring in data science/statistics/computer science, or engaged in related work overseas, or are confident in your foreign language proficiency, welcome to join the translation team.
What you can get: Regular translation training will improve volunteers’ translation skills and enhance their understanding of the frontier of data science. Overseas friends can keep in touch with the development of domestic technology application. THU data school’s industry-university-research background brings good development opportunities for volunteers.
Other benefits: Data scientists from famous enterprises, students from Peking University, Tsinghua University and overseas universities will be your partners in the translation team.
Click “Read the article” to join the data team
Click “Read original” to embrace the organization