GloVe trains a word vector

Input: corpus

Output: word vector

The steps include downloading the code, preparing the corpus and training the model

Download the code

git clone https: // github . com / stanfordnlp / GloVe . git

After the download is a folder, the file directory is as follows:

2. Prepare the corpus

The format of the corpus is the text after word segmentation. Words are separated by Spaces, and the method of word segmentation is chosen by oneself, such as jieba, etc., as follows:

Urgent notice Notice 7:50 on time Wuling Building meeting open early break up eleven eleven eleven thirty one one thirty on time break up time meeting party service members 12.30 tomorrow noon first first regular meeting place to be determined please allow time to receive Please reply to all members about the time of the meeting. Inform the Publicity department that the department will have a copy of this department regular meeting at 1230 noon tomorrow, Wednesday j4101. Please bring your pen with you Member time session session Session Session sessionCopy the code

Put the corpus (counts.txt) in the Glove directory and add the path to demo.sh. Part of the original demo.sh code is as follows:

#! /bin/bash set -e # Makes programs, downloads sample data, trains a GloVe model, and then evaluates it. # One optional argument can specify the language used for eval script: matlab, octave or [default] python make if [ ! -e text8 ]; then if hash wget 2>/dev/null; then wget http://mattmahoney.net/dc/text8.zip else curl -O http://mattmahoney.net/dc/text8.zip fi unzip text8.zip rm text8.zip fi CORPUS=text8 VOCAB_FILE=vocab.txt COOCCURRENCE_FILE=cooccurrence.bin COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin BUILDDIR= Build SAVE_FILE= Vectors VERBOSE=2 MEMORY=4.0 VOCAB_MIN_COUNT=5 VECTOR_SIZE=50 MAX_ITER=15 WINDOW_SIZE=15 BINARY=2 NUM_THREADS=8 X_MAX=10Copy the code

The modified file is as follows

#! /bin/bash set -e # Makes programs, downloads sample data, trains a GloVe model, and then evaluates it. # One optional argument can specify the language used for eval script: matlab, octave or [default] python #make #if [ ! -e text8 ]; then # if hash wget 2>/dev/null; then # wget http://mattmahoney.net/dc/text8.zip # else # curl -O http://mattmahoney.net/dc/text8.zip # fi # unzip Text8.zip # rm text8.zip #fi CORPUS=counts. TXT #CORPUS COOCCURRENCE_FILE=cooccurrence.bin COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin BUILDDIR=build SAVE_FILE=vectors VOCAB_MIN_COUNT=5 # VECTOR_SIZE=300 # MAX_ITER=15 # WINDOW_SIZE=15 # BINARY=2 NUM_THREADS=8 X_MAX=10Copy the code

Iii. Training model

Start by executing make in the glove directory (MinGW may need to be installed if it doesn’t exist)

make
Copy the code

When the instruction is executed, a build folder is generated, which generates the dependency files needed for training.

Then run demo.sh:

Sh or bash demo.sh or sh demo.shCopy the code

At the end of the training, we get three vector files:

Each line in vocab.txt is a word and its frequency

TXT contains the vectors of the words we need