Four steps to understand the GloVe! (With code implementation)

1. Tell me something about the GloVe

As the title of GloVe’s paper suggests, **GloVe stands for Global Vectors for Word Representation, It is a word representation tool based on count-based & Overall Statistics, which can representa word as a vector composed of real numbers that capture some semantic properties between words. Such as similarity, analogy and so on. We can calculate semantic similarity between two words by using vector operations, such as Euclidean distance or cosine similarity.

2. Implementation steps of GloVe

2.1 Construction of co-occurrence matrix

What is a co-occurrence matrix?

Co-occurrence matrix, as its name implies, means co-occurrence. Co-occurrence matrix of word documents is mainly used to discover topics, which are used in topic models, such as LSA.

Word – Word co-occurrence matrix in local window can mine syntactic and semantic information, for example:

I like deep learning.
I like NLP.
I enjoy flying

{“I like”,”like deep”,”deep learning”,”like NLP”,”I enjoy”,” Enjoy flying”,”I like”}.

We can obtain a co-occurrence matrix (symmetric matrix) :

Each middle cell shows the number of times a row and column phrase appears together in the dictionary, which is a feature of co-occurrence.

The co-occurrence matrix of GloVe

A co-ocurrence Matrix X is constructed based on corpus, and each element Xij in the Matrix represents the number of times that word I and context word J appear together in a context window of a specific size. In general, the minimum unit for this degree is 1. But GloVe did not think so. It proposed a decay function based on the distance D between two words in the context window: Decay =1/d was used to calculate the weight, meaning that the further apart two words were, the less weight they had on the total count.

2.2 Approximate relationship between word vector and co-occurrence matrix

To construct the approximate relationship between Word Vector and co-occurrence Matrix, the author of this paper proposed the following formula to approximate the relationship between the two:

Among them,Is the word vector that we finally want to solve;Are the bias term of two word vectors respectively. Of course, you have a lot of questions about this formula, like where did it come from, why did you use this formula, why did you construct a two-word vector? W_i ^ T and \ tilde {w} _j? ? Please refer to the references at the end of this article.

2.3 Construction of loss function

With the formula of 2.2, we can construct its Loss function:

The basic form of loss function is the simplest Mean Square Loss, but a weight function is added on this basisSo what does this function do, and why did you add it? We know that there are frequent co-occurrences of many words in a corpus, so we want to:

These words are more important than rare co-occurrences that appear together, so this function will be non-decreasing.
However, we do not want this to be overweighted, and should not increase when it reaches a certain level.
If two words do not appear together, that is, so they should not participate in the calculation of loss function, that is, f(x) should satisfy f(0)=0.

There are many functions that meet the above three conditions, and the author of this paper adopts the following form of piecewise function:

The function graph looks like this:

2.4 Training the GloVe model

Although many people claim that GloVe is an unsupervised learning method (because it does not require manual label), GloVe still has a label. This label is the log(Xij) in the formula above, and the vector in the formulaIt’s about constantly updating/learning parameters, so essentially it’s no different from supervised learning training, it’s all gradient descent.

Specifically, here’s what the experiment in this paper did: ** adopts AdaGrad’s gradient descent algorithm to randomly sample all non-zero elements in matrix X. The learning rate is set to 0.05. The vector size is less than 300 and the vectors of other sizes are iterated for 50 times. Until it converges. The final result is that the two vectors areX is symmetric, so in principleYes and symmetric, the only difference is that the initial value is different, resulting in a different final value.

So these two things are actually equivalent and can be used as final results.But for robustness, we end up choosing the sum of the two ** as the final vector (different initialization equals different random noise, which improves robustness). After training a corpus composed of 40 billion tokens, the experimental results obtained by ** are shown in the figure below:

The graph uses three metrics: semantic accuracy, syntactic accuracy, and overall accuracy. It is not hard to see that Vector Dimension is optimal at 300, while context Windows size is roughly between 6 and 10.

3. Comparison between GloVe, LSA and Word2Vec

LSA (Latent Semantic Analysis) is an early count-based word vector representation tool. It is also based on co-occurance matrix, but the SVD matrix decomposition technology is used to reduce the dimension of large matrices. As we know, the complexity of SVD is very high, so its calculation cost is relatively high. It also gives the same statistical weight to all words. These shortcomings are overcome one by one in GloVe.

The biggest disadvantage of Word2vec is that it does not make full use of all the corpus, so GloVe actually combines the advantages of the two. According to the experimental results given in this paper, the performance of GloVe is far better than that of LSA and Word2VEc, but some people online say that GloVe and Word2VEc actually have similar performance.

4. Code implementation

Generating word vector

Download the GitHub project: github.com/stanfordnlp…

Decompress the package and run it in the directory

make

Perform compilation operations.

Then run sh demo.sh to train and generate word vector files: vectors. TXT and vectors

GloVe code implementation

Machine Learning

5. References

GloVe,
NLP text representation from word bag to Word2Vec

Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]