0 x00 the
Alink is a new generation of machine learning algorithm platform developed by Alibaba based on real-time computing engine Flink. It is the first machine learning platform in the industry that supports both batch algorithm and streaming algorithm. This article and above will lead you to analyze the implementation of Word2Vec in Alink.
Since Alink’s public information is too little, the following is my own speculation, and there will definitely be omissions. I hope you can point out that I will update at any time.
0x01 Previous review
We learned about the concept of Word2Vec, the overall architecture in Alink and the completion of the input processing, as well as the dictionary, binary tree.
Now that we have a constructed Huffman tree and initialized vectors, we can start to input text for training.
1.1 General flow chart above
First, give an overall flow chart above:
1.2 Review of Huffman trees
1.2.1 Variable Definition
Now define the variables as follows:
- N: The number of words contained in the context of a word, with the same meaning as n in n-gram
- M: The length of the word vector, usually between 10 and 100
- H: The size of the hidden layer is generally on the order of 100
- N: The size of a dictionary, usually between 1W and 10W
- T: Train the number of words in the text
1.2.2 Why Huffman Trees
Word2vec also uses CBOW and skip-gram to train the model and get the word vector, but does not use the traditional DNN model. The first data structure optimized is to replace the neurons of the hidden layer and the output layer with Huffman tree. The leaf nodes of Huffman tree play the role of the neurons of the output layer, and the number of leaf nodes is the size of the vocabulary. The inner nodes act as hidden layer neurons.
Take CBOW as an example, the input layer is the word vector of N-1 words, the length is M (n-1), the size of the hidden layer is H, and the size of the output layer is N. So the forward time is o(m(n-1)h+hN) = o(hN), which is still the time required to process a word. O (hNT) time complexity is required if all text is to be processed. This is unacceptable.
At the same time, we also noticed that the values of H and T in O (hNT) are relatively fixed. To optimize it, we should mainly start from N. The size of the output layer is N, because the neural network is tasked with N choose 1. So can we decrease N? The answer is yes. The solution is to decompose one classification into multiple classifications, which is also the core idea of Hierarchical Softmax.
A is [1,2,3,4,5,6,7,8]. If it is judged to belong to [1,2,3,4], it will further analyze whether it belongs to [1,2] or [3,4], and so on. This reduces the time complexity of individual words from O (hN) to O (hlogN) and, more importantly, reduces the memory overhead.
From the input to the output, there is a tree structure in the middle, where each node completes a binary (logistic classification) problem. Then there is the problem of how to construct the tree. Huffman tree is adopted here, because in this way, words with higher frequency of occurrence take shorter paths, thus making the average path length of all words reach the shortest.
Given the path 1001 of soccer, the model does not directly output the path 1001, but performs a dichotomy at each node. This is equivalent to turning the final output binary tree into multiple binary tasks. And every root node in the path is a vector to be found. In other words, the model requires not only the variables of each input parameter, but also the vectors of each non-leaf node in the binary tree, which are only temporary vectors.
0 x02 training
2.1 Training Process
Alink implemented the Skip-Gram model based on Hierarchical Softmax.
Now let’s first look at how Hierarchical Softmax is used based on the SkIP-Gram model. The input is a single word w and the output is a vector of 2c words context(w).
The training process mainly includes three stages: input layer, projection layer and output layer.
- For each word in the training sample, the word itself is taken as the input of the sample, and the C words in front of it and the C words behind it are taken as the output of the Skip-Gram model, and the softmax probability of these words is expected to be higher than that of other words.
- We need to build the vocabulary into a Hoffman tree first (this was done above).
- For the input layer to the hidden layer (mapping layer), this step is easier than CBOW, because there is only one word, so x_W is the word vector corresponding to the word W.
- Rising through the gradient method to update our theta wj – 1 and x_w, note the x_w around here have 2 c word vector, we expect the P (xi | xw), I = 1, 2… 2 c is the largest. At this point we notice because the context is mutual, expectations in P (xi | xw), I = 1, 2… While maximizing the 2 c, in turn, we also expect to P (xw | xi), I = 1, 2… 2 c is the largest. So is the use of P (xi | xw) or P (xw | xi), word2vec use the latter, the advantage is in an iterative window, we not only update xw a word, but xi, I = 1, 2… 2c A word contains 2c words. In this way, the overall iteration will be more balanced. For this reason, the Skip-Gram model does not iteratively update the inputs like the CBOW model, but iteratively update the 2c outputs.
- Starting from the root node, the values of the mapping layer need to be continuously logistic classified along the Huffman tree, and each intermediate vector and word vector should be constantly modified.
- Assuming the input of mapping layer is Pro (t) and the word is “football”, that is w(t)= “football”, assuming that its Huffman code is D (t)= “1001”, then according to Huffman code, the path from root node to leaf node is “left, right and left”, that is, starting from the root node, first turn left, then turn right twice, and finally turn left.
- Now that the path is known, the intermediate vectors of each node along the path are corrected from the top down. At the first node, Logistic classification is performed according to the intermediate vector θ (t,1) and Pro (t) of the node. If the classification result is 0, it means that the classification is wrong (should turn to the left, that is, to 1). Then, θ (t,1) should be corrected and the amount of error should be recorded.
- Next, after processing the first node, start processing the second node. θ (t,2) is modified and the error is accumulated. And so on for the next node.
- After all nodes are processed and the leaf node is reached, the word vector V (w(t)) is modified according to the accumulated errors. The concept of learning rate is introduced here, where η stands for learning rate. The higher the learning rate is, the higher the penalty for wrong judgment is, and the larger the correction span of intermediate vector is.
This completes the process of processing a word w(t). If there are N words in a text, repeat the above process N times, from w(0) to w(n-1).
The algorithm flow of SkIP-Gram model based on Hierarchical Softmax is summarized here, and the gradient iteration uses the stochastic gradient ascending method:
- Input: Skip-gram corpus training sample, dimension size of word vector M, context size of skip-gram 2c, step size η
- Output: hofman tree internal node model parameter θ, all word vector w
2.2 Generation of training model
The initialized value of the intermediate vector stored by the non-leaf node of Huffman tree is zero vector, while the word vector of the corresponding word of the leaf node is randomly initialized. We can see that for input and output, AllReduce is performed, which is to aggregate the calculation results of each task.
DataSet <Row> model = new IterativeComQueue()
.initWithPartitionedData("trainData", trainData)
.initWithBroadcastData("vocSize", vocSize)
.initWithBroadcastData("initialModel", initialModel)
.initWithBroadcastData("vocabWithoutWordStr", vocabWithoutWordStr)
.initWithBroadcastData("syncNum", syncNum)
.add(new InitialVocabAndBuffer(getParams()))
.add(new UpdateModel(getParams()))
.add(new AllReduce("input"))
.add(new AllReduce("output"))
.add(new AvgInputOutput())
.setCompareCriterionOfNode0(new Criterion(getParams()))
.closeWith(new SerializeModel(getParams()))
.exec();
Copy the code
2.3 Initialize dictionary & buffer
The InitialVocabAndBuffer class does this by initializing parameters and loading the dictionary into model memory. Here only the first iteration will run.
- The Input array holds all the word vectors of Vocab, that is, all the word vectors of Huffman tree leaf nodes. Size | | V ∗ | | M, initialize the scope / – 0.5 M, 0.5 M, experience rules.
- The Output array holds the parameters of Hierarchical Softmax, which is the parameter vector of all non-leaf nodes of Huffman tree (weight between mapping layer and output layer). Size | | V ∗ | | M, initialization to 0, all experience rules. Actual use | | V – 1 set.
private static class InitialVocabAndBuffer extends ComputeFunction {
Params params;
public InitialVocabAndBuffer(Params params) {
this.params = params;
}
@Override
public void calc(ComContext context) {
if (context.getStepNo() == 1) { // Only the first iteration will run
int vectorSize = params.get(Word2VecTrainParams.VECTOR_SIZE);
List <Long> vocSizeList = context.getObj("vocSize");
List <Tuple2 <Integer, double[]>> initialModel = context.getObj("initialModel");
List <Tuple2 <Integer, Word>> vocabWithoutWordStr = context.getObj("vocabWithoutWordStr");
int vocSize = vocSizeList.get(0).intValue();
// Generate an input of 100 x 12, after which the final word vector is iterated
double[] input = new double[vectorSize * vocSize];
Word[] vocab = new Word[vocSize];
for (int i = 0; i < vocSize; ++i) {
Tuple2 <Integer, double[]> item = initialModel.get(i);
System.arraycopy(item.f1, 0, input,
item.f0 * vectorSize, vectorSize); // Initialize the word vector
Tuple2 <Integer, Word> vocabItem = vocabWithoutWordStr.get(i);
vocab[vocabItem.f0] = vocabItem.f1;
}
context.putObj("input", input); // Put the word vector into the system context
// Generate a 100 x 11 output which is the Hierarchical Softmax parameter
context.putObj("output".new double[vectorSize * (vocSize - 1)]);
context.putObj("vocab", vocab);
context.removeObj("initialModel");
context.removeObj("vocabWithoutWordStr"); }}}Copy the code
2.4 Updating the UpdateModel Model
This is where “distributed computing” is allocated. How to calculate how much to send to which task and where to start is done in DefaultDistributedInfo. The analysis needs to be done in conjunction with the Pieces function. The AllReduce communication model is introduced in detail in [Alink Ramble iii]. The calculation is done in calcmodel.update.
private static class UpdateModel extends ComputeFunction {
@Override
public void calc(ComContext context) {
List <int[]> trainData = context.getObj("trainData");
int syncNum = ((List <Integer>) context.getObj("syncNum")).get(0);
DistributedInfo distributedInfo = new DefaultDistributedInfo();
long startPos = distributedInfo.startPos(
(context.getStepNo() - 1) % syncNum,
syncNum,
trainData.size()
);
long localRowCnt = distributedInfo.localRowCnt( // Calculate the partition information
(context.getStepNo() - 1) % syncNum,
syncNum,
trainData.size()
);
new CalcModel( // Update the model
params.get(Word2VecTrainParams.VECTOR_SIZE),
System.currentTimeMillis(),
Boolean.parseBoolean(params.get(Word2VecTrainParams.RANDOM_WINDOW)),
params.get(Word2VecTrainParams.WINDOW),
params.get(Word2VecTrainParams.ALPHA),
context.getTaskId(),
context.getObj("vocab"),
context.getObj("input"),
context.getObj("output")
).update(trainData.subList((int) startPos, (int) (startPos + localRowCnt))); }}Copy the code
2.5 Computing Updates
Calcmodel.update completes the calculation update.
2.5.1 Approximate calculation of sigmoID function value
In the process of using neural network model to predict the sample. To predict sigma, use the sigmoid function, which is expressed as σ(x)=1 / (1+e^x).
σ(x) varies sharply around x = 0 and flattens out as x is greater than 6 and as x is less than -6, the value of sigma (x) tends to 0 and 1.
Assuming that the sigmoID value is calculated every time, there will be some impact on performance when the sigmoID value is not very strict on precision. Able to use approximate calculations. In word2vec. Divide the interval [−6,6] (MAX_EXP is set to 6) into EXP_TABLE_SIZE equal parts, and calculate the sigmoID value in each interval and store it into the array expTable. If you need to use it, look it up directly from the array.
The implementation in Alink is as follows:
public class ExpTableArray {
public final static float[] sigmoidTable = {
0.002473 f.0.002502 f.0.002532 f.0.002562 f.0.002593 f.0.002624 f.0.002655 f.0.002687 f.0.002719 f.0.002751 f. }}Copy the code
2.5.2 Windows and Context
Context(w) = [1, window]; Alink = [1, window]; So you take C words before and C words after w to form Context(w).
if (randomWindow) {
b = random.nextInt(window);
} else {
b = 0;
}
int bound = window * 2 + 1 - b;
for (int a = b; a < bound; ++a) {
.....
}
Copy the code
2.5.3 training
2.5.3.1 Data Structure
In c code:
- The syn0 array contains all the word vectors of Vocab, that is, all the word vectors of Huffman tree leaf nodes, namely input -> hidden weights. Size | | V ∗ | | M, initialize the scope / – 0.5 M, 0.5 M, experience rules. Is a 1-dimensional array in code, but should be understood as a 2-dimensional array. In fact, it can be viewed as syn0[I, j] where I is the ith word and j is the JTH implicit unit.
- The Syn1 array holds the parameters of Hierarchical Softmax, which is the parameter vector of all non-leaf nodes of Huffman tree, namely hidden—-> output weights. Size | | V ∗ | | M, initialization to 0, all experience rules. Actual use | | V – 1 set.
The original Softmax problem was reduced to a decision tree consisting of approximately log(K) Logistic regression.
The K group θ of Softmax now becomes k-1, representing k-1 nonleaf nodes of the binary tree. In Word2Vec, stored by the syn1 array,.
In the Alink code:
- Input corresponds to syn0, which is v in the figure above.
- Output corresponds to syn1, which is θ in the figure above.
2.5.3.2 Specific code
The code is as follows (we use the maximum likelihood method to find word vectors for all nodes and all internal nodes θ) :
private static class CalcModel {
public void update(List <int[]> values) {
double[] neu1e = new double[vectorSize];
double f, g;
int b, c, lastWord, l1, l2;
for (int[] val : values) {
for (int i = 0; i < val.length; ++i) {
if (randomWindow) {
b = random.nextInt(window);
} else {
b = 0;
}
// In the Skip-gram model. Need to predict the words in the form separately using the current word, so. It's a cyclical process
// Because we need to predict every word in Context(w), we need to loop 2window-2b + 1 over the entire window
int bound = window * 2 + 1 - b;
for (int a = b; a < bound; ++a) {
if(a ! = window) {// Skip the central word
c = i - window + a;
if (c < 0 || c >= val.length) {
continue;
}
lastWord = val[c]; //last_word is the current context word to be predicted
l1 = lastWord * vectorSize; //l1 is the starting position of the current word's word vector in syn0
Arrays.fill(neu1e, 0.f); // Initialize the cumulative error
Word w = vocab[val[i]];
int codeLen = w.code.length;
// Walk through all intermediate nodes of the Haffman tree from the root node to the current word leaf node
for (int d = 0; d < codeLen; ++d) {
f = 0.f;
//l2 is the starting position of the vector of the currently traversed intermediate node in SYN1
l2 = w.point[d] * vectorSize;
// Forward propagation, get the corresponding output value f of the encoding unit
/ / attention! Here model symmetry is used: P (u | w) = p (w | u), in which w as the center word, u for the context (w) in each word, also is the skip word prediction context - although "gramm is to center, when real training or word in context prediction center, unlike CBOW u here is a single word word vector, rather than the vector sum of the window
// Interconnect all nodes along the path to accumulate the inner product of the input vector and the intermediate Node vector
/ / f = sigma (w. theta I)
for (int t = 0; t < vectorSize; ++t) {
// This is X times Y
// The mapping layer is the input layer
f += input[l1 + t] * output[l2 + t];
}
if (f > -6.0 f && f < 6.0 f) {
// Query the corresponding value from ExpTableArray.
f = ExpTableArray.sigmoidTable[(int) ((f + 6.0) * 84.0)];
Loss=xlogp(x)+ (1-x) *log(1-p (x))
/ / which p (x) = exp (neu1 * syn1 [c] [c + l2])/(1 + exp (neu1 [c] * syn1 + l2 [c]))
//x=1-code# the author defines the label here as 1-code, which can actually be code
//log(L) = (1-x) * neu1[c] * syn1[c + l2] -x*log(1 + exp(neu1[c] * syn1[c + l2]))
// Partial derivative of syn1 in log(L), g=(1-code -p (x))*syn1
// So there will be
//g = (1 - vocab[word].code[d] - f) * alpha; Alpha learning rate
// 'g' is the gradient multiplied by the learning rate
// g is the product of gradient and learning rate
/ / attention! Word2vec defines the node with Haffman as 1 as a negative class, and the node with Haffman as 0 as a positive class, that is, the label of a node = 1-d
g = (1.f - w.code[d] - f) * alpha;
// Propagate errors output -> hidden
// According to the calculated correction g and the vector update cumulative error of the intermediate node
for (int t = 0; t < vectorSize; ++t) {
neu1e[t] += g * output[l2 + t]; // Modify the mapping result
}
// Learn weights hidden -> output
for (int t = 0; t < vectorSize; ++t) {
output[l2 + t] += g * input[l1 + t]; // Change the weight between the mapping layer and the output layer}}}for (int t = 0; t < vectorSize; ++t) {
input[l1 + t] += neu1e[t]; // return to modify each word vector
}
}
}
}
}
}
}
Copy the code
2.6 average
The AvgInputOutput class averages Input and output.
.add(new AllReduce("input"))
.add(new AllReduce("output"))
.add(new AvgInputOutput())
Copy the code
The reason is that AllReduce simply accumulates tasks. If context. GetNumTask () tasks are running at the same time, it is easy to simply add them, and the value of context.
private static class AvgInputOutput extends ComputeFunction {
@Override
public void calc(ComContext context) {
double[] input = context.getObj("input");
for (int i = 0; i < input.length; ++i) {
input[i] /= context.getNumTask(); / / average
}
double[] output = context.getObj("output");
for (int i = 0; i < output.length; ++i) {
output[i] /= context.getNumTask(); / / average}}}Copy the code
2.7 Judging convergence
You can see that convergence is the number of iterations.
private static class Criterion extends CompareCriterionFunction {
@Override
public boolean calc(ComContext context) {
return (context.getStepNo() - 1)
== ((List <Integer>) context.getObj("syncNum")).get(0) * params.get(Word2VecTrainParams.NUM_ITER); }}Copy the code
2.8 Serialization Model
This is done in the task with context.gettaskid () 0, and the other tasks return directly. The calculation results of all tasks are collected here.
private static class SerializeModel extends CompleteResultFunction {
@Override
public List <Row> calc(ComContext context) {
// Serialize in task context.gettaskid () 0, other tasks return directly
if(context.getTaskId() ! =0) {
return null; // Other tasks return directly
}
int vocSize = ((List <Long>) context.getObj("vocSize")).get(0).intValue();
int vectorSize = params.get(Word2VecTrainParams.VECTOR_SIZE);
List <Row> ret = new ArrayList <>(vocSize);
double[] input = context.getObj("input");
for (int i = 0; i < vocSize; ++i) {
// Complete serialization
DenseVector dv = new DenseVector(vectorSize);
System.arraycopy(input, i * vectorSize, dv.getData(), 0, vectorSize);
ret.add(Row.of(i, dv));
}
returnret; }}Copy the code
0x03 Output model
The code for the output model is as follows, and the functions are as follows:
- Associate the dictionary with the calculated vector
- Split the model into rows by partition
- Send the model
model = model
.map(new MapFunction <Row, Tuple2 <Integer, DenseVector>>() {
@Override
public Tuple2 <Integer, DenseVector> map(Row value) throws Exception {
return Tuple2.of((Integer) value.getField(0), (DenseVector) value.getField(1));
}
})
.join(vocab)
.where(0)
.equalTo(0) // Associate the dictionary with the calculated vector
.with(new JoinFunction <Tuple2 <Integer, DenseVector>, Tuple3 <Integer, String, Word>, Row>() {
@Override
public Row join(Tuple2 <Integer, DenseVector> first, Tuple3 <Integer, String, Word> second)
throws Exception {
return Row.of(second.f1, first.f1);
}
})
.mapPartition(new MapPartitionFunction <Row, Row>() {
@Override
public void mapPartition(Iterable <Row> values, Collector <Row> out) throws Exception {
Word2VecModelDataConverter model = new Word2VecModelDataConverter();
model.modelRows = StreamSupport
.stream(values.spliterator(), false) .collect(Collectors.toList()); model.save(model, out); }}); setOutput(model,new Word2VecModelDataConverter().getModelSchema());
Copy the code
3.1 Contact dictionary and vector
.join(vocab).where(0).equalto (0).join(vocab).where(0).equalto (0) The two Join sources are as follows:
// Source 1, the computed vector
first = {Tuple2@11501}
f0 = {Integer@11509} 9
f1 = {DenseVector@11502} "0.9371751984171548 0.33341686580829943 0.6472255126130384 0.36692156358000316 0.1187895685629788 0.9223451469664975 0.763874142430857 0.1330720374498615 0.9631811135902764 0.9283700030050634..."
// Source 2, dictionary
second = {Tuple3@11499} "(9, we, com. Alibaba. Alink. Operator. The batch. The NLP. Word2VecTrainBatchOp $Word @ 1 ffa469)"
f0 = {Integer@11509} 9
f1 = "我们"
f2 = {Word2VecTrainBatchOp$Word@11510}
Copy the code
3.2 Split the model into Rows by partition
First, divide the model into rows by partition. This uses the new Java 8 features StreamSupport, Spliterator.
But this is just the Stream approach, not the parallelism (which may be explored in a future article).
model.modelRows = StreamSupport
.stream(values.spliterator(), false)
.collect(Collectors.toList());
Copy the code
For example, a partition yields:
model = {Word2VecModelDataConverter@11561}
modelRows = {ArrayList@11562} size = 3
0 = {Row@11567} "Fat,0.4345151137723066 0.4923534386513069 0.49497589358976174 0.10917632806760409 0.7007392318076214 0.6468149904858065 0.3804865818632239 0.4348997489483902 0.03362685646645655 0.29769437681180916 0.04287936035337748..."
1 = {Row@11568} ",0.4347763498886036 0.6852891840621573 0.9862851622413142 0.7061202166493431 0.9896492612656784 0.46525497532250026 0.03379287230189395 0.809333161215095 0.9230387687661015 0.5100444513892355 0.02436724648194081..."
2 = {Row@11569} "Lao Wang,0.4337285110643647 0.7605192699353084 0.6638406386520266 0.909594031681524 0.26995654043189604 0.3732722125930673 0.16171135697228312 0.9759668223869069 0.40331291071231623 0.22651841541002585 0.7150087001048662....".Copy the code
3.3 Sending Data
And then send the data
public class Word2VecModelDataConverter implements ModelDataConverter<Word2VecModelDataConverter.Word2VecModelDataConverter> {
public List <Row> modelRows;
@Override
public void save(Word2VecModelDataConverter modelData, Collector<Row> collector) {
modelData.modelRows.forEach(collector::collect); // Send data
}
@Override
public TableSchema getModelSchema(a) {
return new TableSchema( / / return schema
new String[] {"word"."vec"},
newTypeInformation[] {Types.STRING, VectorTypes.VECTOR} ); }}Copy the code
0x04 Answer to the question
We’ve mentioned some of the questions above, and we’ll answer them one by one:
- Which modules make use of Alink’s distributed processing capabilities? The answer is:
- Word segmentation, counting (to eliminate low-frequency words, sorting);
- Word sorting;
- Training;
- Which model of Word2vec does Alink implement? CBOW model or Skip-Gram model? The answer is:
- Skip – “gramm model
- Which optimization method does Alink use? Is a Hierarchical Softmax? Negative Sampling? The answer is:
- Hierarchical Softmax
- Are stop words removed in this algorithm? Stop words are words that appear so frequently, such as commas, periods, etc., that they become indistinguishable. The answer is:
- There are no escape terminators in this implementation
- Are adaptive learning rates used? The answer is:
- There is no
0xEE Personal information
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.
0 XFF reference
Word2vec principle derivation and code analysis
The text depth representation model Word2Vec
Principle of Word2VEC (II) Is based on Hierarchical Softmax model
Word2vec principle (I) CBOW and Skip-Gram model basis
Word2vec principle (iii) Model based on Negative Sampling
Word2vec overview
Understanding Word2Vec
Write word2VEC (I): Main concepts and processes
Write word2vec (2): Count word frequency
Write word2vec yourself (3): Build Huffman tree
Do your own word2vec (IV):CBOW and Skip-Gram models
Detailed explanation of mathematical principles in WORD2VEC (1) Table of Contents and preface
Based on Hierarchical Softmax model
Model based on Negative Sampling
Machine learning algorithm implementation parsing – Word2VEC source code parsing
Word2Vec source code analysis
Word2vec source ideas and key variables
Word2Vec source code most detailed parsing (下)
Word2vec source ideas and key variables