“This is the 7th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
The profile
Word vectors and word embeddings can be used to represent words.
But if we want to represent the entire document, we need to use document vectors.
When we refer to a document, we refer to a collection of words that have some meaning to the reader.
- A document can be a sentence or a group of sentences.
- A document can consist of product reviews, tweets, or movie dialogue, ranging in length from a few words to thousands.
- A document can be used in machine learning (deep learning) projects as a sample of text data that the presentation algorithm can learn.
We can use different techniques to represent a document:
- The simplest way is to calculate the average of all the component word vectors of a document and represent the document with the average vector.
- The other thing is Doc2Vec.
Doc2Vec, or Paragraph2vec, is an unsupervised algorithm that provides a vector expression for Sentences/paragraphs/Documents, an extension of Word2vec.
Doc2Vec vector can find the similarity between Sentences/paragraphs/Documents by calculating the distance between vectors, which can be used for text clustering. For labeled data, supervised learning can also be used for text classification. For example, the classic sentiment analysis problem classification annotation file belongs to positive/neutral/negative.
Dov2Vec basic principles
The way to train sentence vectors is very similar to the way to train word vectors.
The core idea of Word2Vec is to predict Wi according to the context of each word Wi. It can be understood that the context of a word has an impact on the generation of word Wi, so the same method can be used to train Doc2Vec. For a sentence S.
i want to drink water
If you want to predict the word want in a sentence. Then features can be generated not only from other words, but also from other words and sentence S for prediction.
The application of Doc2Vec
- Document similarity: You can use document vectors to compare text similarity
- Document recommendation: Recommend similar documents based on what users have read
- Document prediction: based on document vector supervised learning, the prediction model is established to predict document topics.
practice
Gensim library is used to convert news headlines into Doc2Vec vectors
Gensim – Doc2Vec vector
Importing a dependency library
import pandas as pd
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
from gensim.parsing.preprocessing import preprocess_string,remove_stopwords
import random
import warnings
warnings.filterwarnings("ignore")
Copy the code
Load the document
## Load the document
sample_news_dir = "./data/sample_news_data.csv"
df = pd.read_csv(sample_news_dir)
Copy the code
Defining preprocessing classes
Defines a document processing class for removing document stop terms
preprocess_string
The function applies the given filter to the input dataremove_stopwords
The stop () function is used to remove the stop word from a given document
Because Doc2Vec requires each sample to be a TaggedDocument instance, a list of instances is created for each document
Define the preprocessor class
class DocumentDataset(object) :
def __init__(self,data:pd.DataFrame,column) :
document = data[column].apply(self.preprocess)
self.documents = [TaggedDocument(text,[index]) for index,text in document.iteritems()]
def preprocess(self,document) :
return preprocess_string(remove_stopwords(document))
def __iter__(self) :
for document in self.documents:
yield document
def tagged_documents(self,shuffle=None) :
if shuffle:
random.shuffle(self.documents)
return self.documents
Copy the code
Call the class
documents_datasets = DocumentDataset(df,"news")
Copy the code
Create the Doc2Vec model
Similar to Word2Vec, Doc2Vec class contains: min_ count, window, Vector_ size, SAMPLE, negative, workers and other parameters. Among them
min_count
: Ignores all words with frequencies less than the specified frequencywindows
: Sets the maximum distance between the current word and the predicted word in a given sentence
vector_ size
: Sets the size of each vectorsample
: defines thresholds for high-frequency words that allow us to configure periodic downsampling, with a valid range of (O, LE-5)
negative
: If > 0, negative sampling is used to specify how many noise words should be drawn (usually between 5-20). If set to 0, negative sampling is not usedworkers
: number of concurrent threads to train the model (multi-core machines can train faster).
To build vocabularies from sentence sequences, Doc2Vec provides the build_vocab method, where you can see that the object must be an instance of TaggedDocument
docVecModel = Doc2Vec(min_count=1,
window=5,
vector_size=100,
sample=1e-4,
negative=5,
workers=2)
docVecModel.build_vocab(documents_dataset.tagged_documents())
Copy the code
Train and save Doc2Vec
docVecModel.trian(documents_dataset.tagged_documents(shuffle=True),
total_examples=docVecModel.corpus_count,
epochs=30)
docVecModel.save("./docVecModel.d2v")
Copy the code
Check the Doc2Vec
docVecModel[123]
Copy the code