“This is the sixth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Reference notes: Nuggets -NLP pretreatment technology

According to its framework and its own learning, the author expanded the corresponding Feature Extraction content

1. Feature extraction

In order to better train the model, we need to transform the original features of the text into concrete features in two main ways: statistics and Embedding.

Original features: human or machine transformation, such as text and images. Specific characteristics: have been sorted out and analyzed by human beings, can be directly used, such as: the importance of the object, size.

1.1 statistical

  • Word frequency, which refers to the frequency of occurrence of a given word in the file, needs to be normalized to avoid bias towards long text
  • Reverse file frequency, a measure of the general importance of a word, is divided by the total number of files divided by the number of files containing the word

Then, each word will get a TF-IDF value to measure its importance. The calculation formula is as follows:

Check out the Sklear-tfidfVectorizer for clarity

1) The following code uses THE TFIDF algorithm of SkLearn for feature extraction

# Use TFIDF algorithm of SkLearn for feature extraction
import jieba
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer,CountVectorizer

corpus = ["Word frequency, which is the frequency with which a given word appears in the document, needs to be normalized to avoid bias towards long text."."Reverse file frequency, a measure of the general importance of a word, divided by the total number of files divided by the number of files containing the word."]
corpus_list = []
for corpu in corpus:
    corpus_list.append("".join(jieba.cut_for_search(corpu)))
print("\n corpus size: {}\n{}".format(len(corpus_list),corpus_list))

vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=True, norm=None)
tfidf = vectorizer.fit_transform(corpus_list)
weight = tfidf.toarray()
vocab = vectorizer.get_feature_names()
print("\n Vocabulary size: {}\n{}".format(len(vocab),vocab))
print("\n Weight shape: {}\n{}".format(weight.shape,weight))
Copy the code

2) Use TFIDF algorithm in Jieba word segmentation to extract keywords

[' Cerana chinensis is native to China and is an indigenous bee in China. It ADAPTS to the climate and nectar source conditions in all parts of China and is suitable for fixed breeding and stable production. Especially in the southern mountains, can not be replaced by other has bee species status. ', 'Oriental bees in the east of origin, referred to as "east bee, bee was belong to the a variety of medium body medium, distributed in Asia, China, Iran, Japan, Korea and other countries, and the Russian far east. The varieties of individual cold resistance is strong, adapt to the use of southern winter honey. '] seg_list = [] for sentence in sentences: Seg_list.append (", ". Join (jieba.cut(sentence, cut_all=True))) print("\n 模 式 表 size: {}\n{}".format(len(seg_list),seg_list)) keywords = jieba.analyse.extract_tags(sentences[0], topK=20, withWeight=True, AllowPOS =('n','nr','ns') print("\n {}\n{}".format(len(keywords),keywords)) keywords = jieba.analyse.extract_tags(sentences[1], topK=20, withWeight=True, AllowPOS = (' n ', 'nr', 'ns)) print (" \ n keywords size: {} \ n {} ". The format (len (keywords), and keywords))Copy the code

1.2 Embedding – Word2vec practice

Embedding means that words are embedded in a space constituted by the weight of hidden layer of neural network so that words with similar semantics have similar distance in this space. Word2vec is the expressive method for this domain. The general network structure is as follows:

The input layer is the one-hot encoded word, the hidden layer is the Embedding dimension we want to get, and the output layer is the predicted result based on the corpus. This network is iterated continuously to make the predicted results more and more close to the real results until convergence. Then we get the word encoding, a dense word vector containing semantic information, which can be used as the input of the subsequent model.

Reference: some information version old code failure,gensim please subject to the tutorial version, ensure that the code can run through

[1] : getting-started-with-word2vec-and-glove-in-python

[2] Python ∣ Gensim training word2vec and related functions

[3] : Word2vec used in Gensim

[4] : Word2vec used in Gensim

1.1.1 Self-built Dataset To create and train Word2vec

import gensim
print("Gensim version:",gensim.__version__)

# Gensim version: 3.8.3
Copy the code

Gensim is a powerful natural language processing tool, including N common models:

Basic corpus processing tools, LSI, LDA, HDP, DTM, DIM, TF-IDF, Word2vec, paragrapH2vec

The first way: The easiest way to train (fast)

# The easiest way to train - one key training
# introduction word2vec
from gensim.models import word2vec

# Import data set
sentences = ['Cerana chinensis is native to China. It is adapted to the climate and nectar source conditions in all parts of China. It is suitable for fixed breeding and stable production, especially in the southern mountainous areas, and has an irreplaceable position for other bee species. '.Oriental honeybee is a medium species of honeybee, distributed in China, Iran, Japan, Korea and other Asian countries as well as the Far East of Russia. This variety has strong cold resistance and ADAPTS to the southern winter nectar source. ']
seg_list = []
for sentence in sentences:
    seg_list.append("".join(jieba.cut(sentence, cut_all=True)))
    
# Syncopated vocabulary
sentences = [s.split() for s in seg_list]

# Build a model
model = word2vec.Word2Vec(sentences, min_count=1,size=100)
"""Word2Vec argument min_count: In different sizes of corpus sets, we also need different base word frequency. For example, if we want to ignore words that occur only once or twice in a larger corpus, we can control this by setting the min_count parameter. Generally, a reasonable parameter value is between 0 and 100. Size: The main argument is used to set the dimensions of the word vector. The default value in Word2Vec is 100 layers. Larger layers mean more input data, but can also improve overall accuracy, with reasonable Settings ranging from 10 to hundreds. Workers: parameter is used to set the number of threads for concurrent training, but only if Cython is installed. "" "
# Make a correlation comparison
model.wv.similarity('Oriental'.'China')
Copy the code

Second method: Phased training method (flexible)

# Import data set
sentences = ['Cerana chinensis is native to China. It is adapted to the climate and nectar source conditions in all parts of China. It is suitable for fixed breeding and stable production, especially in the southern mountainous areas, and has an irreplaceable position for other bee species. '.Oriental honeybee is a medium species of honeybee, distributed in China, Iran, Japan, Korea and other Asian countries as well as the Far East of Russia. This variety has strong cold resistance and ADAPTS to the southern winter nectar source. ']
seg_list = []
for sentence in sentences:
    seg_list.append("".join(jieba.cut(sentence, cut_all=True)))

# Syncopated vocabulary
sentences = [s.split() for s in seg_list]

Start an empty model
new_model = gensim.models.Word2Vec(min_count=1) 

Build a vocabulary
new_model.build_vocab(sentences)                  

# Train the WORD2vec model
new_model.train(sentences, total_examples=new_model.corpus_count, epochs=new_model.epochs)  

# Make a correlation comparison
new_model.wv.similarity('Oriental'.'China')
Copy the code

Another function of phased training is incremental training

# Incremental training
old_model = gensim.models.Word2Vec.load(temp_path)
# old_model = new_model
more_sentences = [['the northeast'.'black bee'.'distribution'.'in'.'China'.'Heilongjiang province'.Raohe County.', '
                   ,'it'.'is'.'in'.'locked-in'.'superior'.'the'.'Natural environment'.'in'.', '.'through'.'Natural selection'.'and'.'artificial'.'进行'.'what'.'nurture'.'the'.'China'.'the only'.'the'.'place'.'good'.'bee species'.'. ']]
old_model.build_vocab(more_sentences, update=True)
old_model.train(more_sentences, total_examples=model.corpus_count, epochs=model.epochs)
# Make a correlation comparison
new_model.wv.similarity('Oriental'.'China')
Copy the code

1.1.2 Word2vec imported from external corpus

Text8 download address

The first way: loading corpus method

# external corpora into [text8 】 : http://mattmahoney.net/dc/text8.zip
sentences = word2vec.Text8Corpus('./text8')
model = word2vec.Word2Vec(sentences, size=200)
Copy the code
flag = False
if flag:
    class Text8Corpus(object) :
        """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip ."""
        def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH) :
            self.fname = fname
            self.max_sentence_length = max_sentence_length

        def __iter__(self) :
            # the entire corpus is one gigantic line -- there are no sentence marks at all
            # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens
            sentence, rest = [], b''
            with utils.smart_open(self.fname) as fin:
                while True:
                    text = rest + fin.read(8192)  # avoid loading the entire file (=1 line) into RAM
                    if text == rest:  # EOF
                        words = utils.to_unicode(text).split()
                        sentence.extend(words)  # return the last chunk of words, too (may be shorter/longer)
                        if sentence:
                            yield sentence
                        break
                    last_token = text.rfind(b' ')  # last token may have been split in two... keep for next iteration
                    words, rest = (utils.to_unicode(text[:last_token]).split(),
                                   text[last_token:].strip()) if last_token >= 0 else ([], text)
                    sentence.extend(words)
                    while len(sentence) >= self.max_sentence_length:
                        yield sentence[:self.max_sentence_length]
                        sentence = sentence[self.max_sentence_length:]
Copy the code

The second method: load model file method

Npy should be guaranteed to have vectors
model_normal = gensim.models.KeyedVectors.load('text.model')
model_binary = gensim.models.KeyedVectors.load_word2vec_format('text.model.bin', binary=True)
Copy the code

1.1.3 Storage and reading methods of word2vec in two formats

# Normal save
model.wv.save('text.model')
# model = Word2Vec.load('text8.model')
model_normal = gensim.models.KeyedVectors.load('text.model')

# binary save
model.wv.save_word2vec_format('text.model.bin', binary=True)
# model = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True)
model_binary = gensim.models.KeyedVectors.load_word2vec_format('text.model.bin', binary=True)
Copy the code

1.1.4 How to use word2vec? A thousand days and a thousand days

# Similarity comparison
model.similarity('shaw war'.'Wang Yibo'),model.similarity('shaw war'.'张艺兴'),model.similarity('Wang Yibo'.'张艺兴')
Copy the code

# Arrange similar words
model.most_similar(positive=['company'], topn=10)
Copy the code

# Irrelevant word recognition
model.doesnt_match("Breakfast, lunch, supper, potato, supper, supper.".split())
Copy the code

# Compare the similarity between two lists
model.n_similarity(['the emperor'.'the king'."朕"."Son of heaven"], ['your majesty'])
Copy the code

# get the word vector
model["Chongqing"]
Copy the code

Get the vocabulary
model.vocab.keys()
vocab = model.index2word[:100]
Copy the code

1.1.5 Word2VEc and deep learning framework

How does WORD2VEc combine with neural networks?

  • Tensorflow version
  • The torch version

Resources: zhuanlan.zhihu.com/p/210808209

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

Import the word2vec model and preprocess it
def w2v_model_preprocessing(content,w2v_model,embedding_dim,max_len=32) :
    Initialize the [word: index] dictionary
    word2idx = {"_PAD": 0}  
    # Train data vocabulary construction
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(sentences)
    vocab_size = len(tokenizer.word_index)  # Thesaurus size
    print(tokenizer.word_index)
    error_count = 0
    # Store an array of all vectors in word2vec, with one more bit, and word vectors all zeros for padding
    embedding_matrix = np.zeros((vocab_size + 1, w2v_model.vector_size))
    print(embedding_matrix.shape)
    for word, i in tokenizer.word_index.items():
        if word in w2v_model:
            embedding_matrix[i] = w2v_model.wv[word]
        else:
            error_count += 1
    # Training data word vector truncation completion
    seq = tokenizer.texts_to_sequences(sentences)
    trainseq = pad_sequences(seq, maxlen=max_len,padding='post')
    return embedding_matrix,trainseq


# From text to TF available word2vec
sentences = ['Cerana chinensis is native to China. It is adapted to the climate and nectar source conditions in all parts of China. It is suitable for fixed breeding and stable production, especially in the southern mountainous areas, and has an irreplaceable position for other bee species. '.Oriental honeybee is a medium species of honeybee, distributed in China, Iran, Japan, Korea and other Asian countries as well as the Far East of Russia. This variety has strong cold resistance and ADAPTS to the southern winter nectar source. ']
seg_list = []
for sentence in sentences:
    seg_list.append("".join(jieba.cut(sentence, cut_all=True)))
sentences = [s.split() for s in seg_list]

# Some hyperparameters
max_len = 64
embedding_dim = model.vector_size

embedding_matrix,train_data = w2v_model_preprocessing(sentences,model,embedding_dim,max_len)
embedding_matrix.shape,train_data.shape
Copy the code
from tensorflow.keras.models import Sequential,Model
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import Dense,Dropout,Activation,Input, Lambda, Reshape,concatenate
from tensorflow.keras.layers import Embedding,Conv1D,MaxPooling1D,GlobalMaxPooling1D,Flatten,BatchNormalization
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2

def build_textcnn(max_len,embeddings_dim,embeddings_matrix) :
    # Build textCNN model
    main_input = Input(shape=(max_len,), dtype='float64')
    # Word embedding (using pre-trained word vectors)
    embedder = Embedding(
                         len(embeddings_matrix), # represents the number of possible values of words in the text data, and how many words are retained from the corpus
                         embeddings_dim, # Size of the vector space to embed the word
                         input_length=max_len, # Specify length
                         weights=[embeddings_matrix],The length of the input sequence, i.e. the number of words in one input
                         trainable=False # set the word vector not to be updated as a parameter
                         )
    embed = embedder(main_input)
    flat = Flatten()(embed)
    dense01 = Dense(5096, activation='relu')(flat)
    dense02 = Dense(1024, activation='relu')(dense01)
    main_output = Dense(2, activation='softmax')(dense02)
    model = Model(inputs=main_input, outputs=main_output)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model


TextCNN = build_textcnn(64,embedding_dim,embedding_matrix)
# Data set load
X_train, y_train = train_data,to_categorical([0.1], num_classes=2)
# Rough model training
history = TextCNN.fit(X_train, y_train,
                      batch_size=2,
                      epochs=3,
                      verbose=1)
Copy the code

1.1.6 Visualization method of Word2vec

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

def wv_visualizer(model,word) :

    # Find the ten most similar words
    words = [wp[0] for wp in  model.wv.most_similar(word,topn=10)]
    Extract word vectors corresponding to words
    wordsInVector = [model.wv[word] for word in words]
    wordsInVector
    # PCA dimension reduction
    pca = PCA(n_components=2)
    pca.fit(wordsInVector)
    X = pca.transform(wordsInVector)
    # draw graph
    xs = X[:, 0]
    ys = X[:, 1]
    # draw
    plt.figure(figsize=(12.8))
    plt.scatter(xs, ys, marker = 'o')
    for i, w in enumerate(words):
        plt.annotate(
            w,
            xy = (xs[i], ys[i]), xytext = (6.6),
            textcoords = 'offset points', ha = 'left', va = 'top', * *dict(fontsize=10)
        )

    plt.show()

Pass in the target phrase
wv_visualizer(model,["man"."king"])
Copy the code

NLP cute new, shallow talent, mistakes or imperfect place, please criticize!!