Word2vec Word2vec Word2vec Word2vec

0 x00 the

This article will try to use an easy to understand way, as far as possible does not involve mathematical formula, but from the overall idea, using perceptual intuition thinking to help you sort out Word2vec related concepts.

0 x01 takeaway

1. Why

Originally, I just wanted to write Word2vec, but I did not expect to comb down the knowledge points one by one, but Word2vec itself only occupies a small part. So let’s just focus on teasing out the concepts so you can understand Word2vec better.

In order to discuss Word2vec, we need to know (or take it as a given) the following prerequisites:

  • Unique thermal coding/distributed representation
  • score function / loss function
  • Self information/entropy/cross entropy/KL divergence/maximum likelihood estimation
  • Softmax regression

We know that there are four key parts to machine learning:

  • A digital representation of input.
  • A digital representation of the output.
  • Input to output mapping method.
  • Evaluation criteria for mapping methods.

So our basic carding process is from input to output. is

  • Because of the input of the neural network, you see a unique thermal coding/distributed representation.
  • Because of the training/loss function of neural network, a series of concepts such as self-information/entropy/cross entropy/KL divergence/maximum likelihood estimation are sorted out.
  • Because of the output of the neural network, Softmax regression needs to be sorted out.

Machine learning is an esoteric concept that has mathematical/physical connections, and there are often multiple interpretations of the same concept. So you find one interpretation that you can understand, and you just go with it.

2. Briefly Word2vec

Word2vec model is a simplified neural network. But the neural network learns not to accurately predict the right center/surrounding words, but to get the word — >vector mapping. Its specific breakdown is as follows:

  • Construct data: Construct word pairs from raw data in the form of [input word, out word], that is, [data x, label y].

  • Input layer: All words are one-hot encoded as input, and the input is n-dimensional vector (n is the number of words in the word list)

  • Hidden layer: There is only one hidden layer in the middle (no activation function, just linear unit). The hidden layer actually stores word Vectors for all the words in the vocabulary. This is a matrix of [Vocabulary size x embedding Size]. Each row of the matrix corresponds to a word vector of a particular word.

  • Output layer: the output is also unique heat vector. The Output Layer dimension is the same as the Input Layer dimension, and the values of each dimension add up to 1. Softmax regression is used. Softmax guarantees that the output vector is a probability distribution. Once converted to probability, we can use the method of maximum likelihood estimation (cross entropy) to find the maximum likelihood or minimum cross entropy.

  • Define the Loss function: used to predict the correct output/optimization model.

    The value of our label Y is a probability distribution, and the output layer is also a probability distribution after softMax processing. In this way, cross entropy can be used to measure the difference between the output of neural network and our label Y, and loss can be defined.

  • Iterative training: Gradient descent algorithm is adopted to compare loss between Prediction and label Y in each iteration, and then optimize accordingly to ensure that similar words have similar vectors.

  • Final matrix: The output Layer is discarded, and only the output unit of the Hidden Layer (namely the weight between Input Layer and Hidden Layer) is used to form the Look up table.

After the model is trained, we will not use the trained model to handle new tasks. What we really need is the parameters that the model has learned from the training data, such as the weight matrix of the hidden layer.

Let’s break down the concepts one by one.

0x01 Data/input layer related concepts

Computer systems can only recognize numbers, so they need to digitally represent what they observe in order to achieve their predictions.

1. Unique thermal coding

One-hot encoding is to ensure that only one bit of a single feature in each sample is in state 1, and all the others are 0. The specific coding example is as follows. In the corpus, each of Hangzhou, Shanghai, Ningbo and Beijing corresponds to a vector, in which only one value is 1 and the rest are 0.

Hangzhou [0.0.0.0.0.0.0.1.0And... .0.0.0.0.0.0.0Shanghai []0.0.0.0.1.0.0.0.0And... .0.0.0.0.0.0.0Ningbo []0.0.0.1.0.0.0.0.0And... .0.0.0.0.0.0.0[] Beijing0.0.0.0.0.0.0.0.0And... .1.0.0.0.0.0.0]
Copy the code

Its disadvantages are:

  • The dimension of vector will increase with the number and type of sentence words. If the vectors corresponding to the names of all the cities in the world were combined into a single matrix, the matrix would be too sparse and cause dimensional disaster.
  • The city coding is random, and the vectors are independent from each other, which cannot represent the relevant information between words at the semantic level.

Therefore, people want to make the following improvements to the unique thermal coding:

  • Change each element of the vector from an integer to a floating-point representation of the entire range of real numbers;
  • To lower dimensional continuous values, which are dense vectors. The original sparse large dimensional compression is embedded into a smaller dimensional space. And words with similar meanings will be mapped to similar positions in vector space.

Simply put, we are looking for a spatial mapping, embedding a higher-dimensional word vector into a lower-dimensional space. And then you can move on.

2. Distributed presentation

In fact, Distributed Representation (Distributed Representation) was first proposed by Hinton in 1986. The basic idea is to express each word as n-dimensional dense and continuous vector of real numbers. On the other hand, the relationships between real vectors can represent the similarity between words, such as vector angles or Euclidean distance.

There is a special term to describe the distributed representation technology of word vector — Word embedding. As can be seen from the name, independent thermal encoding is equivalent to encoding words, while distributed representation is to compress words from sparse large dimensions into lower dimensional vector space.

The biggest contribution of Distributed representation is that related or similar words are closer in distance.

Compared with one-Hot, another difference between Distributed representation is that the dimension decreases greatly. For a 100W word table, we can use 100-dimensional real number vector to representa word, while one-Hot requires 100W dimension.

Why is it that a word vector that maps into a vector space can represent which words are defined and how similar are they to each other?

  • On the question of why words can be expressed. Distribution is essentially finding a mapping function that projects the original one-hot representation of each word into a lower dimensional space and maps it one to one. So distributed can mean certain words.

  • On why distributed can also know the relationship between words. We must understand the distributed hypothesis. It is based on the distributed assumption that words appearing in the same context should have similar meanings. All methods of learning word embedding use mathematical methods to model the relationship between words and context.

The core idea of distributed representation of word vector consists of two parts:

  • Choose a way to describe the context;
  • Choose a model to describe the relationship between the target word and its context.

In fact, both the hidden layer of neural network and the probabilistic topic model with multiple potential variables are applied in distributed representation.

0x02 Training/hiding layer related concepts

The emphasis here is on cross entropy as a loss function, which pulls out a series of concepts.

  • KL divergence (relative entropy) can be used to measure the proximity of two distributions, which should be used to measure the cost of loss calculation.
  • And minimizing KL divergence in certain cases is equivalent to minimizing cross entropy. And the calculation of cross entropy is easier, so cross entropy is used as the cost.
  • Minimizing cross entropy gives the same result as the maximum likelihood function, so both can be used as loss functions.

1. score function

The linear classification method contains two important components: Score function and loss function.

It is assumed that the real classification of training samples is K.

The score function of linear classifier is the following linear function: f(xi) = Wxi + b. Wherein W ([K x D] matrix) and B ([K x 1] vector) are collectively referred to as parameters of Score function.

The output of the function is a vector of K x 1, where the ith term represents the score in the ith class. The higher the score, the more likely xi is to be in this category.

The i-th row of W and B determines the grade of i-th, so the i-th row of W and B is called the classifier of i-th.

Since XI is a D-dimensional vector (each dimension is a feature), we can think of it as a point on the D-dimensional space (sample space).

For each classifier, there is a straight line (precisely, the Decision Boundary). So we can see where xi is on this line, and we can make a classification.

2. loss function

Sometimes it’s called cost function or the Objective. It is used to measure how well the predicted results of the model (W and B) fit the real labels on the input set.

“More accurate prediction” is equivalent to “less loss”.

3. Since the information

An event X in line with distribution P appears, and the minimum length of information required to convey this information is self-information, expressed as


I ( x ) = l o g 1 P ( x ) I(x) = log \frac{1}{P(x)}

Explanation:

About coding. Transferring a random variable state value requires some state bits. The longer the code, the more information. So we have this equation: p=1/a^n.

  • P is the probability of getting something.
  • A is the number of bits that the storage unit can store. If the storage unit is a bit, a is 2.
  • N indicates the encoding length.

So the probability of the total number of states of the random variable is going to be a to the n, so the probability of picking a single value is going to be p=1 over a to the n. So the code length is derived from the formula I(x).

4. The entropy

An event is randomly selected from distribution P, and the optimal average length of information required to convey this information (event) is Shannon entropy, expressed as


H ( P ) = x P ( x ) l o g 1 P ( x ) H(P) = \sum_{x} P(x)log \frac{1}{P(x)}

Explanation:

If the probability of an event is P (x), its self-information is I(x).

The optimal average information length is the weighted average (or expectation) of each possible entropy of information. It is the sum of {length of information} times {probability of occurrence of that information}.


The information entropy = x ( information x Probability of occurrence x Authentication information x Amount of information needed ) Information entropy = \ Sum_ {x} (probability of occurrence of information x x amount of information needed to verify information X)

For example:

For example, in A betting race, there are four horses {A, B, C, D} with A probability of winning {1/2, 1/4, 1/8, 1/8}. the


H ( P ) = 1 2 l o g ( 2 ) + 1 4 l o g ( 4 ) + 1 8 l o g ( 8 ) + 1 8 l o g ( 8 ) = 1 2 + 1 2 + 3 8 + 3 8 = 7 4 b i t s H(P) = \frac{1}{2} log(2) + \frac{1}{4} log(4) + \frac{1}{8} log(8) + \frac{1}{8} log(8) = \frac{1}{2} + \frac{1}{2} + \frac{3}{8} + \frac{3}{8} = \frac{7}{4}bits

In binary computers, a bit of zero or one represents the answer to a binary question. This means that the average code length required to encode the event of which horse wins is 1.75 bits in a computer.

5. Cross entropy

About the two probability distributions p(x) and Q (x) of the sample set, where P (x) is the unknown real distribution and Q (x) is the approximate non-real distribution. If the average coding length from the real distribution P (x) is represented by the non-real distribution Q (x), it is called cross entropy.


H ( p . q ) = x p ( x ) l o g 1 q ( x ) = x p ( x ) log q ( x ) H(p,q) = \sum_{x} p(x)log \frac{1}{q(x)} = -\sum_{x} p(x)\log {q(x)}

Explanation:

  • Which distribution does the information come from? The answer is the real distribution P.
  • The way information is transmitted (that is, how it is encoded) depends on which distribution, and the answer is an approximate distribution Q.
  • Cross entropy is to use Q (x) to model P (x), use Q (x) to establish a coding system, and transmit the value of X to the receiver.

Well, the ** that’s confusing ** is, in the cross entropy formula, which one goes in log(), p or q? It can be roughly understood as follows:

  • Because in the cross entropy formula, log of alpha, which is used to calculate self-information, is the code length.
  • The purpose of **q(x)** is to simulate and approximate the real distribution P (x) to encode and transmit information. So put q of x in the log of x.

It might be clearer to think of it this way:

The optimal information transmission mode of distribution Q is used to convey an event randomly selected from the real distribution P, and the average information length required is cross entropy, expressed as


H p ( Q ) = x P ( x ) l o g 1 Q ( x ) Hp(Q) = \sum_{x} P(x)log \frac{1}{Q(x)}

Distribution Q is an approximate non-real distribution that encodes a random event x selected from P, and the encoding result of this event is Q(x). This event x is chosen by the probability P(x) in P.

For example:

Again, the 4 horses {A, B, C, D} and Q are predicted to win with A probability of {1/4, 1/4, 1/4, 1/4}. the


H ( P . Q ) = 1 2 l o g ( 4 ) + 1 4 l o g ( 4 ) + 1 8 l o g ( 4 ) + 1 8 l o g ( 4 ) = 2 2 + 2 4 + 2 8 + 2 8 = 2 b i t s H(P,Q) = \frac{1}{2} log(4) + \frac{1}{4} log(4) + \frac{1}{8} log(4) + \frac{1}{8} log(4) = \frac{2}{2} + \frac{2}{4} + \frac{2}{8} + \frac{2}{8} = 2 bits

It takes 2 bits to send P with Q.

6. KL divergence (relative entropy)

In information theory, relative entropy is used to measure the number of additional bits required to encode the average sample from P using q-based encoding. Typically, P represents the real distribution of data, and Q represents the theoretical distribution, model distribution, or approximate distribution of P. Given the probability distribution of a character set, we can design an encoding that minimizes the average number of bits required to represent the string composed of that character set.

KL divergence tells us how close Q and P are, or how similar they are, using cross entropy minus information entropy. That is, the optimal information transmission mode of distribution Q is used to convey distribution P, compared with the optimal information transmission mode of distribution P itself, the average information length is more expensive, which is KL divergence, which measures the difference between two distributions.

KL divergence is expressed as follows:


D K L ( P Q ) = D q ( P ) = H ( P . Q ) H ( P ) = x P ( x ) l o g 1 Q ( x ) x P ( x ) l o g 1 P ( x ) = x P ( x ) l o g P ( x ) Q ( x ) D_{KL}(P||Q) = Dq(P) = H(P,Q) – H(P) = \sum_{x} P(x)log \frac{1}{Q(x)} – \sum_{x} P(x)log \frac{1}{P(x)} = \sum_{x} P(x)log \frac{P(x)}{Q(x)}

Because KL divergence is not commutative, it cannot be understood as the concept of “distance”. What it measures is not the distance between two distributions in space, but the information loss of one distribution compared with another distribution.

At base two, log two, that’s how many bits of information did we lose.

Pay attention to,

  • If expressed as D_q(P), the distribution of the information to be conveyed is in parentheses;
  • If the expression into D_KL (P | | Q) form, distribution in the previous message.

There are two distributions involved in formula D_Q(P) :

  • Which distribution does the information come from? The answer is the real distribution P
  • The way information is transmitted depends on which distribution, and the answer is an approximate distribution Q

Let’s go back to the previous example.

KL divergence is 2 bits – 7/4 bits = 1/4 bit. It shows that in information theory, the information loss of product fitting P with Q is 1/4 bit.

KL divergence can be used to calculate the expense in machine learning evaluation are the differences between the two distribution, due to the distribution of P is given, so the KL divergence and cross entropy function is the same, and because cross entropy is a less (relative entropy = cross entropy and information entropy), more simple, so the choice of cross entropy will be better.

7. Cross entropy as loss function

Cross entropy describes the distance between the actual output (probability) and the expected output (probability),

The smaller the cross entropy value is, the closer the two probability distributions are, that is, the better the fitting is. If P(x) is the original distribution of data, the distribution Q(x) that minimizes the cross entropy is the distribution closest to P(x).

The complete cross entropy formula in information theory is:


C r o s s E n t r o p y : H ( p . q ) = H ( p ) + D K L ( p q ) CrossEntropy : H(p,q) = H(p)+D_{KL}(p∣∣q)

H p ( q ) = x p ( x ) log q ( x ) Hp(q) = -\sum_{x} p(x)\log {q(x)}

It can be seen that the cross-entropy loss is obtained by substituting the p and Q distributions into this formula.

When the P distribution is known, the entropy of real data is constant; Because Q doesn’t affect H of P. So the cross entropy and the KL divergence are equivalent. Minimizing cross entropy for Q is equivalent to minimizing KL divergence.

Minimizing cross entropy is essentially minimizing the “distance” between the predicted distribution and the actual distribution. In other words, the cross entropy loss function “wants” all probability densities of the predicted distribution to be on the correct classification.

In other words, when the output is probability distribution, the cross entropy function can be used as the measure of ideal and reality. This is why it can be used as a loss function for neural networks activated by Softmax functions.

8. Maximum likelihood estimation

thought

The mathematical basis of cross entropy function as loss function comes from Maximum Likelihood Estimation in statistics.

Maximum likelihood estimation is a parameter estimation, or more specifically, a point estimation.

The idea of maximum likelihood estimation: the parameter that maximizes the probability of occurrence of observed data (sample) is the best parameter. In popular terms, it is the most like estimation method (most likely estimation method), that is, the event with the highest probability, the most likely to happen.

Maximum likelihood estimation is often used to take advantage of known sample results and back-deduce the parameter values that are most likely to lead to this result. Usually, the model results have been determined and used to back-deduce the parameters in the model. The process of determining parameter values is to find the set of parameters that “maximizes the likelihood that the model will produce real observations.” Because the result is known, if a parameter can maximize the probability of the result, it is the optimal parameter.

For example: Suppose an old hunter goes out hunting with a young apprentice. Now, we know that one of them killed a rabbit, and if you had to guess who killed the rabbit, most people would just guess that it was the old hunter, and that’s using maximum likelihood. The samples are dead rabbits. The parameter is the old hunter.

derivation

In the context of maximum likelihood estimation, since the population distribution parameters are unknown, θ is used to represent one or more unknown parameters of the population.

Suppose the distribution is P= P (x; θ), x is the sample of occurrence, θ is the parameter of generation estimation, p(x; θ represents the probability of x occurring when the estimated parameter is θ.

What we want to calculate is the probability of observing all of these at the same time, which is the joint probability distribution of all the observed data points. Because we assume that the sample between independent and obey the parameters for the distribution of theta, then obtain the probability of an arbitrary sample xi can be expressed as P (xi | theta), so the current sample set all the data of total probability was separately observed the product of the probability of each data point (that is, the product of the marginal probability) :

P (X) = P (x1 | theta) P (x2 | theta) P (x3 | theta)... P (xn | theta)Copy the code

As an observation of the whole population can be obtained after sampling, the corresponding probability is known in a specific sample set. The above function can be regarded as the probability of obtaining the current sample set under different values of parameter θ, namely possibility likelihood. Therefore, it is called likelihood function of parameter θ relative to sample set X, denoted as L(θ), namely:

L (theta) = L (x1, x2,... ,xn; Theta) = p (x1 | theta)... P (xn | theta) LianChengCopy the code

The fait accompli of sampling and getting sample set X, we can assume that this is the most likely outcome, the one with the highest probability of all possible outcomes, and that’s how the maximum likelihood estimator gets its name. At this point, we need to look for the θ value that maximizes the likelihood function:

argmax L(theta)
Copy the code

Here argmax ƒ(x) is the set of all independent variables X that enable the function ƒ(x) to obtain its maximum value, and the problem we want to study finally becomes an extremum problem once again.

The most straightforward way to think about finding an extreme value of a function is to take the derivative, set the derivative to zero, and then solve the equation for θ (assuming, of course, that the function L(θ) is continuously differentiable). But what if theta is a vector with multiple parameters? We’re taking the partial derivative of L of theta with respect to all of these parameters, which is the gradient. So for n unknown parameters, there are n equations, and the solution of the system is the extreme point of the likelihood function, and you get the values of these n parameters.

use

Two questions give us an idea of how to use maximum likelihood.

  • Q: How does our estimate or model best fit the sample?

    A: Find the maximum likelihood function.

  • Q: What is the best set of metrics we are looking for?

    A: When the likelihood function is at its maximum, the corresponding parameter θ

Combine Word2vec example to say briefly and popularly.

  1. Training model, and then the output layer through Softmax to get a probability distribution.
  2. Generate a likelihood function L(w). The likelihood function is generally constructed from the above probability distribution (probability values). W is the specific value of the hidden matrix, as the parameter of the likelihood function L.
  3. And for the sake of calculation, I’ll take the log likelihood function, which is logL(w).
  4. Take the maximum of the likelihood function logL(w). W is the optimal parameter for L when you get the maximum.
  5. And the resulting w is the hidden matrix that we need.

9. Minimize cross entropy vs. maximum likelihood estimation

The minimum cross entropy is consistent with the maximum likelihood estimation for parameter estimation of the model, which can be deduced from the mathematical formula as follows:

The maximum likelihood estimation method is used to maximize the probability that the predicted value is close to the real data for the model with parameter θ ^ :


Θ ^ = a r g   m a x Theta. Π i = 1 N p ( x i Θ ) \hat\Theta=arg\ max_\theta \Pi_{i=1}^Np(x_i|\Theta)

Practice, is easy to be LianCheng maximum or minimum overflow, cause computational instability, due to the monotonicity of the log function, so will take negative type on the exponential, minimize the negative logarithm likelihood (NLL) of the result is the same with the original formula, so as to maximize the logarithm likelihood function is equivalent to minimizing the negative logarithm likelihood function.


Θ ^ = a r g   m i n Θ Σ i = 1 N l o g ( p ( x i Θ ) ) \hat \Theta =arg\ min_\Theta – \Sigma_{i=1}^Nlog(p(x_i|\Theta))

Perform maximum likelihood estimation for the predicted value of the model,

\hat \Theta =arg\ min_\Theta – \Sigma_{i=1}^Nlog(q(x_i|\Theta))\

= a r g min Θ Σ x X p ( x ) l o g ( q ( x Θ ) ) =arg\min_\Theta-\Sigma_{x\in X}p(x)log(q(x|\Theta))

= a r g   m i n Θ H ( p . q ) =arg\ min_\Theta H(p, q)

So minimizing NLL is the same as minimizing cross entropy. This is also the reason why many models adopt maximum likelihood estimation as the loss function.

In the world of machine learning you can probably think of cross entropy and maximum likelihood estimation as the same thing, and if you look at these two terms you should associate them together.

** For the multi-class classification problem, the likelihood function is to measure the likelihood between the current polynomial distribution model with predict as parameter and the sample value label under a single observation. ** This is the likelihood function of a single sample. The last thing we need to maximize is the likelihood function of the entire sample set (or a batch).

It can be said that cross entropy directly measures the difference between two distributions or two models. The likelihood function explains the degree to which a distribution model with the output of model can explain the sample set. Therefore, it can be said that the two are “the same appearance and different sources”, but “they come to the same destination”.

Training 10.

The training process of probability model is the process of parameter estimation.

The process of machine learning is to hope that the distribution P(model) learned by the model on training data and the distribution of real data **P(real)** are as close as possible. Then how to minimize the difference between the two distributions? Use the default method to minimize its KL divergence.

However, we do not have the distribution of real data, so we can only settle for the second best. We hope that the distribution learned by the model and the distribution P(training) of training data should be the same as possible, that is, training data should be regarded as the agent between the model and real data.

Assuming that the training data is sampled from the population (Independent and identically distributed sampled), we minimize the empirical error of the training data to reduce the model generalization error.

Simple said.

  • The ultimate goal is to expect the distribution of the learned model to be consistent with the real distribution: P(model) = P(real)
  • However, the real distribution is unknowable, so we have to assume that the training data is independently sampled from the real data with the same distribution: P(training) = P(real)
  • The next best thing is to learn a model distribution that is at least consistent with the distribution of training data: P(model) = P(training)

It is thus idealized that if the model (left) can learn the distribution of training data (middle), it should approximately learn the distribution of real data (right) P(Model) = P(training) = P(real).

0x03 Output Layer Related Concepts

The main concept in the output layer is Softmax.

An important property of the Softmax function is to normalize the output of the full connection layer to the probability of each corresponding classification. Once converted to probability, we can use the method of maximum likelihood estimation (cross entropy) to find the maximum likelihood or minimum cross entropy.

1. The normalization

Why normalization? In each iteration, the correlation weight of the words to be predicted is increased, while the correlation weight of other words is reduced through normalization. This is not hard to understand. The sum of the predicted probabilities is 1, so increasing the probability of one of these words means suppressing the probability of the others. Can we increase the probability of just one of these words? Yes, but the convergence is slow.

2. Softmax regression

Softmax is a regression model used in machine learning to solve the classification of MECE principle — each classification is independent of each other and all classifications are completely exhausted. Men and women, for example, are responsible for MECE principles.

The Softmax function takes an n-dimensional vector as input and converts the value of each dimension to a real number between (0,1). The goal of our training is to make the samples belonging to the KTH category as likely as possible after Softmax. This makes the classification problem more easily explained by statistical methods.

max vs soft max

Max

If you have two numbers, a and B, and a is greater than B, if you take Max, you just take a, there’s no other way.

But sometimes I don’t want to, because it makes the one with the lower score hungry. So I want the item with the high score to be picked up frequently, and the item with the low score to be picked up occasionally, so I use soft Max.

Soft max

Now a and B are still a and B, a> B, if we calculate the probability of picking a and B according to the softmax, because the softmax value of A is larger than that of B, A will be picked frequently, and B will be picked occasionally, the probability depends on their original size. So it’s not Max, it’s Soft Max.

As its name suggests, the Softmax function is a “soft” maximum function, which does not take the maximum output category directly as a result of classification, but also takes into account the output of other relatively small categories.

Softmax gets the probability

What are the respective probabilities of A and B (here the probabilities are relative probabilities)? Let’s look at the definitions below:

Let’s say we have an array V, and we want to get the maximum value of that array. Vi represents the ith element in V, so the Softmax value of element I is


S i = e V i j e V j Si = \frac{e^{Vi}}{\sum_{j} e^{Vj}}

That is, the ratio of the exponent of this element to the sum of the exponents of all elements.

In a nutshell, you calculate the proportion of each value in a set of values.

Again popular say: absolute Max is only choose their favorite male god, Soft Max all spare tires to score, but also made a normalization.

Such as:

Add the original output is 5, 1, -1, after the Softmax function takes effect, they are mapped to the value of (0,1) : 0.9817, 0.0180, 0.0003, and the cumulative sum of these values is 1 (satisfying the nature of probability), so we can understand it as probability. When selecting the output node at last, we can select the node with the maximum probability (that is, the node with the maximum value) as our prediction target!

Important properties of Softmax

To sum up, Softmax classifier will further transform score vector output by linear operation: by Softmax function, [score of current sample for each class] is converted to [probability distribution of current sample for each class], which is the real output of the model. This is where the name of the Softmax classifier comes from.

Note: The score vector input by the Softmax classifier is regarded as an unnormalized log probability distribution (probabilities). After appropriate transformation, the output is normalized to the probability of each corresponding category. Once converted to probability, we can use the method of maximum likelihood estimation (cross entropy) to find the maximum likelihood or minimum cross entropy.

To convert unnormalized log probabilities back to normalized probabilities, run the following commands: First, power each log probabilities to the base e to convert it to ordinary probability, and then normalize the ordinary probability of all classes so that they add up to 1.

0x04. Word2vec(Popular Learning Version)

After many hardships, I finally arrived at Word2vec…..

Word2vec, through training, maps each word in the sparse space of independent heat vector to a shorter word vector. All these word vectors constitute vector space, and then we can use ordinary statistical methods to study the relationship between words. That is, the processing of text content can be simplified as vector operation in vector space, and the similarity in vector space can be calculated to represent the semantic similarity of text.

For example, the word King is expressed in four dimensions: “Royalty”,”Masculinity”, “Femininity” and “Age”. The corresponding word vector may be (0.99,0.99,0.05,0.7). Of course, in practice, we can’t give a good explanation for every dimension of the word vector.

1. Training program

Word2vec two training schemes are CBOW and Skip-GRAM.

  • CBOW extrapolates the target word from the original statement; The training input is the word vector corresponding to the context related words of a particular word, and the output is the word vector of a particular word.
  • On the contrary, skip-gram extrapolates the original statement from the target word. That is, the input is the word vector of a particular word, and the output is the contextual word vector of a particular word.

In layman’s terms, yes

  • CBOW: “forecast words add up around the word” (P (wt | Context))
  • Skip – “gramm:” the current words respectively to predict the surrounding words “(P (wothers | wt))

For example: Hangzhou is a nice city. We need to construct a mapping between context and target term, which is the relationship between input and label. This assumes that the slide window size is 1.

  • CBOW can be mapped as follows: [Hangzhou,a] — > IS, [is,nice] — > A, [A,city] — >nice
  • The mapping relationship that SkIP-GRAM can produce is as follows: (is,Hangzhou), (is, A), (A,is), (A,nice), (nice,a), (nice, City)

2. Infrastructure

Word2vec model is a simplified neural network. But the goal of this lesson is not to accurately predict the correct center/surrounding words, but to get the word — >vector mapping. Again, the specific breakdown is as follows:

  • Construct data: Construct word pairs from raw data in the form of [input word, out word], that is, [data x, label y].

  • Input layer: One-hot encoding of all words as input, the input should be n-dimensional vector (n is the number of words)

  • Hidden layer: There is only one hidden layer in the middle (no activation function, just linear unit). The hidden layer actually stores word Vectors for all the words in the vocabulary. This is a matrix of [Vocabulary size x embedding Size]. Each row of the matrix corresponds to a word vector of a particular word.

  • Output layer: the output is also unique heat vector. The Output Layer dimension is the same as the Input Layer dimension, and the values of each dimension add up to 1. Softmax regression is used. Softmax guarantees that the output vector is a probability distribution. Once converted to probability, we can use the method of maximum likelihood estimation (cross entropy) to find the maximum likelihood or minimum cross entropy.

  • Define the Loss function: used to predict the correct output/optimization model.

    Our label Y value is a probability distribution, and the output layer is also a probability distribution. In this way, cross entropy can be used to measure the difference between the output of neural network and our label Y, and loss can be defined.

  • Iterative training: Gradient descent optimization algorithm is generally used to minimize the cost function. Each iteration compares the loss between Prediction and label Y, and then optimizes accordingly to ensure that similar words have similar vectors.

  • Final matrix: The output Layer is discarded, and only the output unit of the Hidden Layer (namely the weight between Input Layer and Hidden Layer) is used to form the Look up table.

After the model is trained, we will not use the trained model to handle new tasks. What we really need is the parameters that the model has learned from the training data, such as the weight matrix of the hidden layer.

3. Break it down

3.1 Building Data

4. “The quick brown fox jumps over The lazy dog.” As an example. There are eight words in this sentence (here The and The count as The same word).

(x,y) is just pairs of words. For example, (The, quick) is a word pair, the is the sample data, quick is the label of the sample.

Take two words on the left and two on the left of the scanned words. The 2 here is called window size and can be adjusted. In this way, four words, two on the left and two on the right, were taken out and formed word pairs with the scanned words respectively as our training data. When words at the beginning and end of a sentence are scanned, there are several fewer pairs of words available.

Finally, the training data were obtained:

(the, quick),(the, brown)
(quick,the),(quick,brown),(quick,fox)
(brown,the),(brown,quick),(brown,fox),(brown,jumps)
(fox,quick),(fox,brown),(fox,jumps),(fox,over)
......
Copy the code

Take the letters “(fox, jumps)” and “(fox, brown)” for example: the sentence “jumps” and “brown” can be interpreted as the context of fox. We hope that when typing “fox”, the neural network will tell us which word is more likely to appear around fox. Doing so depends on the number of jumps of the alphabet (fox,) and (fox, brown) appearing more often in the training set, causing the neural network to perform more jumps of the alphabet’s alphabet to predict who will be near the word.

3.2 input layer

The words are converted into a heat-coded numeric form that can be used as input to a neural network. For example,(fox, jumps) might be [(00000001),(00000010)], depending on the result of the encoding. The input layer is an n-dimensional vector, where n is the number of words in the vocabulary.

The input of the neural network is the unique thermal coding of word pairs (x,y) in the training data, and the model will learn the statistical results from the number of occurrences of each pair of words. 4. As mentioned above, when getting more training sample pairs like (fox, jumps) and seeing fewer combinations like (fox, brown), then our model is given the word “fox” as input after training, 4. The “jumps” is assigned a higher probability than “brown” in the output.

3.3 the hidden layer

The hidden layer actually stores word Vectors for all the words in the vocabulary. This is a matrix of [Vocabulary size x embedding Size]. Each row of the matrix corresponds to a word vector of a particular word.

How many neurons should the hidden layer be? It depends on how many dimensions we want the word vector to be, how many dimensions there are hidden neurons in the word vector. For example, if there are eight words in the vocabulary, the input received by each hidden neuron is an 8-dimensional vector. Suppose we have three hidden neurons, then the weight of the hidden layer can be represented by a matrix with eight rows and three columns.

3.4 Cross entropy (to learn the principle, the following is the reference code)

The instance reference

TensorFlow’s cross entropy function TF.nn.softmax_cross_entropy_with_logits (Logits, labels, Name =None) is taken as a reference.

  • The logits parameter is the output of the last layer of the neural network (W * X matrix). If there is batch, its size is [batchsize, num_classes]; if there is a single sample, its size is num_classes.
  • Parameter labels: Actual labels, size as above.

The specific execution process is roughly divided into two steps:

  • Do a softmax on the output of the last layer of the network. This step is usually to calculate the probability that the output falls into a certain category. For a single sample, the output is a num_classes-sized vector ([Y1, Y2,Y3…]). Where Y1, Y2, Y3… Each represents the probability of being in this class.
  • Output vector of Softmax [Y1, Y2,Y3…] Do a cross entropy with the actual tag of the sample.
A further note on Logits:

So what do you mean by Logits? The Odds are against me.


O d d s ( A ) = events A The number of / The number of other events (i.e A The number of times) Odds (A) = Number of times A occurs/number of other events (i.e., number of times A does not occur)

In other words, Odds of event A equals the ratio of the number of occurrences of event A to the number of occurrences of other (non-A) events;

By contrast, the probability of event A is equal to the ratio of the number of occurrences of event A to the number of all events.

It is easy to deduce that the range of probability P (A) and Odds (A) is different. The former is locked between [0,1], while the latter is [0, plus infinity]

What does logit matter?

Notice the decomposition of the word Logit, the Log of it, the Odds. Here we can define Logit:


L o g i t ( O d d s ) = l o g ( P 1 P ) Logit(Odds)=log(\frac{P}{1-P})

This formula is actually called the Logit transformation.

One of the important features of Logit, unlike probability, is that it has no upper and lower bounds, which makes modeling easier. Makes regression fitting easy!

In general, the base of the Logit logarithm is the natural object E, and the Odds are denoted by the notation θ.


Θ = l n ( P 1 P ) \Theta = ln(\frac{P}{1-P})

Obviously, given θ, we can easily derive the probability from e:


p = e Θ 1 + e Θ p = \frac{e^\Theta}{1+e^\Theta}

So we start with the Logit transformation, which allows us to easily fit the data (logistic regression), and then revert back to our familiar probabilities. It is this cycle that facilitates data analysis. In a way, these transformations are very similar to catalysts in chemistry.

application

After processing by SoftMax, it becomes the probability of “normalization” (set as P1). The newly generated probability P1 and the probability distribution represented by labels (set as P2) are used as parameters to calculate the cross entropy.

This difference information serves as the basis for our network parameter tuning. Ideally, the two distributions should be as close as possible. If there is a difference (also known as an error signal), we adjust the parameters to make them smaller, and that’s what the loss (error) function does.

Finally, through continuous tuning, logit is locked to an optimal value (optimal means that the cross entropy is minimized and the network parameters are also optimal).

3.5 training

We use maximum likelihood to construct an objective function (loss function), after the completion of the objective function construction needs an optimization method to find the parameters. Random gradient descent (SGD) is the default standard method for minimizing the loss function F (W,b) (i.e., maximizing likelihood function) and can be used to find weights and biases for deep learning.

The purpose of gradient descent, frankly speaking, is to reduce the distance between the real value and the predicted value, while the loss function measures the distance between the real value and the predicted value, so the purpose of gradient descent is to reduce the value of the loss function. How do we decrease the loss function? All we have to do is keep changing the value of the hidden layer to make the loss function smaller and smaller.

SGD minimizes the loss function simply by subtracting the estimated gradient F (W k, b k) during the KTH iteration update.


parameter = Parameter − Learning rate x Partial derivative of loss function with respect to parameter Parameter = parameter − Learning rate x partial derivative of the loss function with respect to parameters

3.6 the output layer

Based on these training data, the neural network will output a probability distribution, which represents the possibility that each word in our dictionary is an output word. The output layer has the same number of neurons as the number of words in the corpus. Each neuron can be thought of corresponding to the output weight of a word, the word vector multiplied by the output weights have a number, this number represents the output neurons of the possibility of around words appear in the input size, through the output of all of the output layer neurons softmax operation, we put the output of the output layer structured as a probability distribution.

Softmax frankly means that the original output, such as 3,1 and -3, will be mapped to the value of (0,1) through the function of softmax, and the sum of these values is 1 (meeting the nature of probability), so we can understand it as probability.

One thing to note here is that we say the output is the probability that the word will appear around the input word, and this “around” includes both the front and the back of the word.

For example, we use water margin to train:

If we input a word “Song Gongming”, then in the output probability of the final model, the probability of related words like “brother” and “timely rain” will be much higher than the probability of unrelated words like “this” and “wuyi”. Because “brother” and “timely rain” are more likely to appear in the window of “Song Gongming” in the text.

3.7 Final Result

Ultimately what we need is the trained weight matrix W, not the probability values of the output layers. The vector that each word of the input layer is multiplied by the matrix W is the word vector we want.

This matrix (word embedding of all words) is also called look up table, that is to say, one hot of any word multiplied by this matrix will get its own word vector. With the look up table, the word vector of words can be obtained directly from the table without the training process.

For example, suppose the raw data is:

The brown fox jumps.

The results of independent thermal coding are as follows:

The      [1.0.0.0]
brown    [0.1.0.0]
fox      [0.0.1.0]
jumps    [0.0.0.1]
Copy the code

Finally, the hidden layer parameter matrix is


[ 1 2 3 4 5 6 7 8 9 10 11 12 ] \left[ \begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \end{matrix} \right]

This matrix stores all word vectors, and here is the actual word vector for each word

The      [1, 2, 3]
brown    [4, 5, 6]
fox      [7, 8, 9]
jumps    [10,11,12]
Copy the code

If you type the word fox when applying


[ 0 0 1 0 ] \left[\begin{matrix} 0 & 0 & 1 & 0 \end{matrix} \right]

The final word vector is:


[ 7 8 9 ] \left[\begin{matrix} 7 & 8 & 9 \end{matrix} \right]

The calculation process is as follows:


[ 0 0 1 0 ] x [ 1 2 3 4 5 6 7 8 9 10 11 12 ] = [ 7 8 9 ] \left[\begin{matrix} 0 & 0 & 1 & 0 \end{matrix} \right] \times \left[ \begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ 10 & 11 & 12 \end{matrix} \right] = \left[\begin{matrix} 7 & 8 & 9 \end{matrix} \right]

3.8 the use of

Once you have the dense word vector, you can do the correlation calculation.

For example, the cosine value of the included Angle is used to represent the distance between the meanings of words. The distance ranges from 0 to 1. The larger the cosine value is, the closer the meanings of two words are.

0x05 Word2vec(Actual production version)

Word2vec, as the input of the neural probabilistic language model, itself is actually a by-product of the neural probabilistic language model, and an intermediate result of learning a language model through neural networks.

Specifically, “a language model” refers to “CBOW” and “skip-gram.” The learning process uses two approx. Hierarchical Softmax or Negative Sampling to reduce complexity. Two models multiplied by two methods, there are four implementations.

1. Basic structure

In principle, the basic network structure should be four layers: input layer, mapping layer, hidden layer and output layer respectively.

But in practice, no matter what kind of model, its basic network structure is to omit the hidden layer. Why did you get rid of this layer? It was said that the authors of Word2vec were too busy with matrix operations from hidden layer to Output layer.

So in the end, there are three layers, namely the input layer, the mapping layer and the output layer.

2. CBOW

CBOW is a language model that estimates current words in a given context.

2.1 Likelihood function

In the case of Huffman Softmax, calculating the probability of the context vector to the center word is a series of dichotomies, because from the root node to the corresponding leaf node of the center word, it is necessary to decide multiple times whether to go along the left node or the right node to the leaf node. Each leaf node represents a word in the corpus, so each word can be uniquely encoded by 01, and its coding sequence corresponds to a sequence of events. If w represents any word in corpus C, then we can calculate the conditional probability of w:


p ( w C o n t e x t ( w ) ) p(w|Context(w))

Then, for the center word W, the total probability from the root node to the center word node is the probability multiplication of these nodes.

Finally, its learning objective, namely, maximizing the logarithmic likelihood function, is obtained as follows:


L = w C l o g p ( w C o n t e x t ( w ) ) L = \sum_{w \in C} log p(w|Context(w))

2.2 Basic Structure

The input layer is the word vector of the context’s words.

In HS mode and Negative mode, the processing from the input layer to the mapping layer is the same, but the processing from the mapping layer to the output layer is different. The specific operation from the input layer to the mapping layer is: sum up each word vector in the context window, and then average it to get a vector with the same dimension as the word vector, let’s say the context vector, which is the vector of the mapping layer.

The output layer outputs the most likely w.

Because of corpus in vocabulary is fixed | C | a, so the process can actually as a classification problem. A given feature, from the | | C classification to pick one.

3. Skip-gram

Skip-gram simply reverses CBOW’s causality, that is, knowing the current word and predicting the context.

There are two differences with CBOW:

  • The input layer is no longer multiple word vectors, but one word vector
  • The projection layer does nothing but pass word vectors from the input layer to the output layer, right

In Hierarchical Softmax, U represents a word in the context of W. Each u can be encoded as a 01 path, and write them together to get the target function.

Thus, the probability function of the language model can be written as:


P ( C o n t e x t ( w ) w ) = Π u C o n t e x t ( w ) p ( u w ) P(Context(w)|w) = \Pi_{u \in Context(w)}p(u|w)

Notice that this is a bag of words model, so each u is unordered, or independent of each other.

Finally, its learning objective, namely, maximizing the logarithmic likelihood function, is obtained as follows:


L = u C o n t e x t ( w ) l o g p ( u w ) L = \sum_{u \in Context(w)} log p(u|w)

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.

0 XFF reference

Commonly known as word2vec

Word2vector for Machine Learning

Word2vec neural Network detailed analysis — TrainModelThread analysis

Word2vec model training save loading and simple use

The end! CS231n official notes authorized translation collection published

www.zhihu.com/question/23…

Machine learning — Softmax computing

Softmax with multiple categories

Cross entropy and maximum likelihood estimation

Understand the relationship between cross entropy and maximum likelihood estimation

Maximum likelihood estimation and cross entropy

Maximum Likelihood Estimation, Cross Entropy, and deep neural networks

Why You Should Use Cross-Entropy Error Instead Of Classification Error Or Mean Squared Error For Neural Network Classifier Training

How to understand the asymmetry of KL divergence?

Understanding of KL divergence

Perfect explanation of cross entropy

The relationship between cross entropy and KL divergence and maximum likelihood estimation in deep learning

Cs231n Notes UFLDL Tutorial CS231n translation

The information entropy

Github.com/thushv89/nl…

www.countbayesie.com/blog/2017/5…

Huh? You still think likelihood function is the same thing as cross entropy?

Light on Math Machine Learning: Intuitive Guide to Understanding Word2vec

Why SoftMax?

The principle is so simple Softmax classification

Why can cross-entropy be used to calculate costs?

Why cross entropy as a loss function

Why can cross-entropy be used to calculate costs?

DeepLearning study Notes – Maximum likelihood estimation

Improve the knowledge of machine learning algorithms by understanding maximum likelihood estimation (MLE)

Starting with maximum likelihood estimation, you need to lay the foundation of machine learning

Implementation and application of four Cross Entropy algorithms for TensorFlow

Word2vec source code in detail

Word2vec principle derivation and code analysis

Machine learning algorithm implementation analysis – Word2VEC source code analysis

The magic of Embedding