“This is the 7th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

The profile

Word vectors and word embeddings can be used to represent words.

But if we want to represent the entire document, we need to use document vectors.

When we refer to a document, we refer to a collection of words that have some meaning to the reader.

  • A document can be a sentence or a group of sentences.
  • A document can consist of product reviews, tweets, or movie dialogue, ranging in length from a few words to thousands.
  • A document can be used in machine learning (deep learning) projects as a sample of text data that the presentation algorithm can learn.

We can use different techniques to represent a document:

  • The simplest way is to calculate the average of all the component word vectors of a document and represent the document with the average vector.
  • The other thing is Doc2Vec.

Doc2Vec, or Paragraph2vec, is an unsupervised algorithm that provides a vector expression for Sentences/paragraphs/Documents, an extension of Word2vec.

Doc2Vec vector can find the similarity between Sentences/paragraphs/Documents by calculating the distance between vectors, which can be used for text clustering. For labeled data, supervised learning can also be used for text classification. For example, the classic sentiment analysis problem classification annotation file belongs to positive/neutral/negative.

Dov2Vec basic principles

The way to train sentence vectors is very similar to the way to train word vectors.

The core idea of Word2Vec is to predict Wi according to the context of each word Wi. It can be understood that the context of a word has an impact on the generation of word Wi, so the same method can be used to train Doc2Vec. For a sentence S.

i want to drink water

If you want to predict the word want in a sentence. Then features can be generated not only from other words, but also from other words and sentence S for prediction.

The application of Doc2Vec

  • Document similarity: You can use document vectors to compare text similarity
  • Document recommendation: Recommend similar documents based on what users have read
  • Document prediction: based on document vector supervised learning, the prediction model is established to predict document topics.

practice

Gensim library is used to convert news headlines into Doc2Vec vectors

Gensim – Doc2Vec vector

Importing a dependency library

import pandas as pd
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
from gensim.parsing.preprocessing import preprocess_string,remove_stopwords
import random
import warnings
warnings.filterwarnings("ignore")
Copy the code

Load the document

## Load the document

sample_news_dir = "./data/sample_news_data.csv"
df = pd.read_csv(sample_news_dir)
Copy the code

Defining preprocessing classes

Defines a document processing class for removing document stop terms

  • preprocess_stringThe function applies the given filter to the input data
  • remove_stopwordsThe stop () function is used to remove the stop word from a given document

Because Doc2Vec requires each sample to be a TaggedDocument instance, a list of instances is created for each document

Define the preprocessor class
class DocumentDataset(object) :
    def __init__(self,data:pd.DataFrame,column) :
        document = data[column].apply(self.preprocess)
        self.documents = [TaggedDocument(text,[index]) for index,text in document.iteritems()]

    def preprocess(self,document) :
        return preprocess_string(remove_stopwords(document))

    def __iter__(self) :
        for document in self.documents:
            yield document

    def tagged_documents(self,shuffle=None) :
        if shuffle:
            random.shuffle(self.documents)
        return self.documents
Copy the code

Call the class

documents_datasets = DocumentDataset(df,"news")
Copy the code

Create the Doc2Vec model

Similar to Word2Vec, Doc2Vec class contains: min_ count, window, Vector_ size, SAMPLE, negative, workers and other parameters. Among them

  • min_count: Ignores all words with frequencies less than the specified frequency
  • windows: Sets the maximum distance between the current word and the predicted word in a given sentence
  • vector_ size: Sets the size of each vector
  • sample: defines thresholds for high-frequency words that allow us to configure periodic downsampling, with a valid range of (O, LE-5)
  • negative: If > 0, negative sampling is used to specify how many noise words should be drawn (usually between 5-20). If set to 0, negative sampling is not used
  • workers: number of concurrent threads to train the model (multi-core machines can train faster).

To build vocabularies from sentence sequences, Doc2Vec provides the build_vocab method, where you can see that the object must be an instance of TaggedDocument

docVecModel = Doc2Vec(min_count=1,
                      window=5,
                      vector_size=100,
                      sample=1e-4,
                      negative=5,
                      workers=2)

docVecModel.build_vocab(documents_dataset.tagged_documents())
Copy the code

Train and save Doc2Vec

docVecModel.trian(documents_dataset.tagged_documents(shuffle=True),
                  total_examples=docVecModel.corpus_count,
                  epochs=30)
docVecModel.save("./docVecModel.d2v")
Copy the code

Check the Doc2Vec

docVecModel[123]
Copy the code