This is the first day of my participation in the Gwen Challenge in November. Check out the details: the last Gwen Challenge in 2021
1. Why a participle?
Natural Language Processing (NLP) refers to Natural Language Processing, which is a technology that studies computers to understand human Language.
A computer can’t understand human language by any means, it can only calculate, but by calculating, it makes you feel like it understands human language.
- For example: single =1, double =2, the computer is faced with “single” and “double”, it understands the double relationship.
- Another example: Praise =1, disparagement =0, and when the computer hits 0.5, it knows it’s “both praise and disparagement.”
- Queen again for another example: = {1, 1}, woman = {1, 0}, Kings = {0, 1}, it can understand “woman” and “the king” = “queen”.
You see, when it comes to words, it’s all about numbers.
So, how to convert text into numbers is the most basic step in NLP.
Fortunately, the TensorFlow framework provides a very useful class, Tokenizer, for this purpose.
2. Tokenizer
So let’s say we have a bunch of text, and we want to throw it into the participle and make it a number.
The text format is as follows:
corpus = ["I love cat" , "I love dog" , "I love you too"]
Copy the code
2.1 Constructing word segmentation
If you want a date, you have to talk about one first. To use a word splitter, you first have to build one.
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
Copy the code
2.2 Adaptation text FIT_on_texts
The fit_on_texts method of the word segmentation can be called to fit text.
tokenizer.fit_on_texts(corpus)
Copy the code
After tokenizer eats text data and ADAPTS it, tokenizer has gone from being a geek to a geek, and it knows the text like the back of its hand.
[“I love cat” , “I love dog” , “I love you too”]
-
Tokenizer.document_count records that it has processed several sections of text, with a value of 3 indicating that three sections were processed.
-
Tokenizer. word_index registers all words, and each word is assigned an ID number. This value is {‘cat’: 3, ‘dog’: 4, ‘I ‘: 1, ‘love’: 2, ‘too’: 6, ‘you’: 5}. If cat is 3,3 is cat.
-
Tokenizer. index_word is the opposite of word_index, preceded by {1: ‘I ‘, 2: ‘love’, 3: ‘cat’, 4: ‘dog’, 5: ‘you’, 6: ‘too’}. The number 1 is I, and I is denoted by 1.
-
Tokenizer. word_docs counts the number of occurrences of each word, which is {‘cat’: 1, ‘dog’: 1, ‘I ‘: 3, ‘love’: 3, ‘too’: 1, ‘you’: 1}. For example, the “I” appears three times.
Extending knowledge: case and punctuation
“I love Cat”, “I love Cat”, “I love cat!” And the results were the same after FIT_on_texts.
This shows that its processing is ignoring the case of English letters and English punctuation marks.
2.3 Text serialization texts_to_SEQUENCES
Although the above text has been adapted, it only numbers and counts the words, and the text has not all become numbers.
At this point, the texts_to_SEQUENCES method of the word divider can be called to serialize the text to numbers.
input_sequences = tokenizer.texts_to_sequences(corpus)
Copy the code
[“I love cat” , “I love dog” , “I love you too”]
Through the serialized, text into digital list [[1, 2, 3], [1, 2, 4], [1, 2, 5, 6]].
Extend knowledge: Beyond the corpus OOV
Text can be serialized based on the fact that each word has a number.
1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
i | love | cat | dog | you | too |
I love you -> 1 2 5
When we come across words we haven’t seen before, like I do not love cat. What should I do?
And the odds of that happening are pretty high.
For example, you do a movie sentiment analysis and train a neural network model with 20,000 positive and negative reviews of the movie. When you want to test the effect, this is when you enter a random comment text, the new comment text is likely to contain words that have not been seen before, this is also the time to serialize and predict the results.
Let’s see what happens first, okay?
corpus = ["I love cat"."I love dog"."I love you too"]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
# tokenizer.index_word: {1: 'i', 2: 'love', 3: 'cat', 4: 'dog', 5: 'you', 6: 'too'}
input_sequences = tokenizer.texts_to_sequences(["I do not love cat"])
# input_sequences: [[1, 2, 3]]
Copy the code
From the results, it was ignored.” “I do not love cat” and “I love cat” end up the same, only because “do” and “not” are not recorded.
But what do we do when we don’t want to ignore it?
It’s as simple as passing oov_token=’
‘ when building the Tokenizer.
What does OOV stand for? In natural language text processing, we usually have a vocabulary, which is derived from the training data set. Of course, the thesaurus is limited. When you have a new data set in the future, and there are words in that data set that are not in your existing vocabulary, we call those words out-of-vocabulary, or OOV.
Therefore, as long as the Tokenizer(oov_token=’
‘) is used to build, the Tokenizer will reserve a number for marking superclass words.
So let’s see what happens, okay?
corpus = ["I love cat"."I love dog"."I love you too"]
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(corpus)
# tokenizer.index_word: {1:'<OOV>',2:'i',3:'love',4:'cat',5:'dog',6:'you',7:'too'}
input_sequences = tokenizer.texts_to_sequences(["I do not love cat"])
# input_sequences: [[2, 1, 1, 3, 4]]
Copy the code
From the result, we can see that “I do not love cat” is serialized to 2, 1, 1, 3, 4.
2.4 Sequence fill PAD_SEQUENCES
[“I love cat”, “I love dog”, “I love you too”] has been serialized to [[1, 2, 3], [1, 2, 4], [1, 2, 5, 6]].
Text into numbers this step, seems to be accomplished, in fact, there is still one step.
Look at the picture below. There are two bundles of pencils. Which do you think is more convenient to handle, whether it is storage or transportation?
From the common sense of life, it must be that the B sequence is more convenient to deal with. Because it has a uniform length, you don’t have to worry about the difference, 50 or 50,000, just a simple multiple relationship.
Yes, the computer is just like you, [[1, 2, 3], [1, 2, 4], [1, 2, 5, 6] these numbers are short and long, and it expects you to unify into a single length.
TensorFlow has taken this into account by providing a PAD_SEQUENCES method that does just that.
from tensorflow.keras.preprocessing.sequence import pad_sequences
sequences = [[1.2.3], [1.2.4], [1.2.5.6]]
sequences = pad_sequences(sequences)
# sequences:[[0, 1, 2, 3],[0, 1, 2, 4],[1, 2, 5, 6]]
Copy the code
If you pass in the sequence data, it will take the longest piece of the sequence as the standard length, and the other short pieces of data will be preceded by zeros, so that the sequence length is uniform.
A uniform length sequence is what NLP wants.
Highlight: We’ll see this in a variety of situationsvocab_size = len(tokenizer.word_index) + 1
, total number of words = all words in the text +1, this 1 is actually used to fill the 0, there is no such word in the data set, its purpose is to fill up the length. Pay attention to<OOV>
This is a real word. It means super dimensional.
Extend your knowledge: More custom padding
Data filling processing, there are different scenarios.
It says to add 0 in front, but sometimes you want to add 0 in the back.
sequences=[[1.2.3], [1.2.4], [1.2.5.6]]
sequences=pad_sequences(sequences, padding="post")
# sequences:[[1, 2, 3, 0],[1, 2, 4, 0],[1, 2, 5, 6]]
Copy the code
The padding is the padding type, and the padding=”post” is followed by a padding 0. Padding =”pre”, pre is the default.
Another scenario, which is cropped to a fixed length, is also very common.
As an example, look at the following sequences:
[[2, 3], [1, 2], [3, 2, 1], [1,2,5,6,7,8,9,9,9,1]]
There are four sets of data, and their lengths are 2,2,3,10. If you populate the sequence, all the data will be filled up to 10.
- [2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [1, 2, 5, 6, 7, 8, 9, 9, 9, 1]
In fact, this is not necessary. Because individual data can not lead to excessive redundancy of the overall data.
Therefore, padding needs to be clipped.
We prefer to fill in a sequence with a growth of 5 so that we can accommodate the interests of all parties.
The code is as follows:
sequences=[[2.3], [1.2], [3.2.1], [1.2.5.6.7.8.9.9.9.1]]
sequences=pad_sequences(sequences, maxlen = 5, padding='post', truncating='post')
,2,0,0,0,3,0,0,0 # [[2], [1], [3,2,1,0,0], [1,2,5,6,7]]
Copy the code
Two new parameters are added, one is maxlen = 5, which is the maximum length allowed for the sequence. The other argument is truncating=’ POST ‘, which means truncating from the back (the front is pre).
This code means that whatever sequence comes in, I’m going to get the data with a growth of 5, add 0 for the less, and throw away the more.
In fact, this way is the most commonly used in actual combat. Because we can control the format of the input sequence of the training data, we also train a model with it. But when it comes to forecasting, the input formats are bizarre. Let’s say we trained a model with a 100-length piece of data. When using the model to make predictions, the user enters a 10,000-length data point and the model becomes unrecognizable. So, whether the user enters 10,000 lengths, or 1 length, it should be converted to the training length.
3. Chinese word segmentation
English has Spaces to distinguish words.
I love cat.
There are three words in it: I, love and cat.
However, Chinese does not have a marker to distinguish words.
I like cats.
How many words are in it?
This is awkward.
NLP must be based on vocabulary.
Chinese word splitting is huge, and third-party services are generally used.
For example, stuttering word segmentation is used to achieve word splitting.
3.1 Jieba installation and use method
Code is compatible with Python 2/3
- Fully automatic installation:
easy_install jieba
orpip install jieba
/pip3 install jieba
- Semi-automatic installation: Download it firstpypi.python.org/pypi/jieba/Decompress and run
python setup.py install
- Manual installation: Download code files to place jieba directory in the current directory or site-packages directory
- through
import jieba
To refer to
Pay attention to “Peking University” below:
import jieba
sentence = "".join(jieba.cut("Welcome to Peking University Cafeteria."))
print(sentence) Welcome to Peking University Cafeteria
sentence2 = "".join(jieba.cut("Welcome to Beijing University Student Volunteer Center"))
print(sentence2) Welcome to Beijing University Student Volunteer Center
Copy the code
Natural language processing in Chinese starts with breaking down words. That’s the only difference.