GloVe trains a word vector
Input: corpus
Output: word vector
The steps include downloading the code, preparing the corpus and training the model
Download the code
git clone https: // github . com / stanfordnlp / GloVe . git
After the download is a folder, the file directory is as follows:
2. Prepare the corpus
The format of the corpus is the text after word segmentation. Words are separated by Spaces, and the method of word segmentation is chosen by oneself, such as jieba, etc., as follows:
Urgent notice Notice 7:50 on time Wuling Building meeting open early break up eleven eleven eleven thirty one one thirty on time break up time meeting party service members 12.30 tomorrow noon first first regular meeting place to be determined please allow time to receive Please reply to all members about the time of the meeting. Inform the Publicity department that the department will have a copy of this department regular meeting at 1230 noon tomorrow, Wednesday j4101. Please bring your pen with you Member time session session Session Session sessionCopy the code
Put the corpus (counts.txt) in the Glove directory and add the path to demo.sh. Part of the original demo.sh code is as follows:
#! /bin/bash set -e # Makes programs, downloads sample data, trains a GloVe model, and then evaluates it. # One optional argument can specify the language used for eval script: matlab, octave or [default] python make if [ ! -e text8 ]; then if hash wget 2>/dev/null; then wget http://mattmahoney.net/dc/text8.zip else curl -O http://mattmahoney.net/dc/text8.zip fi unzip text8.zip rm text8.zip fi CORPUS=text8 VOCAB_FILE=vocab.txt COOCCURRENCE_FILE=cooccurrence.bin COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin BUILDDIR= Build SAVE_FILE= Vectors VERBOSE=2 MEMORY=4.0 VOCAB_MIN_COUNT=5 VECTOR_SIZE=50 MAX_ITER=15 WINDOW_SIZE=15 BINARY=2 NUM_THREADS=8 X_MAX=10Copy the code
The modified file is as follows
#! /bin/bash set -e # Makes programs, downloads sample data, trains a GloVe model, and then evaluates it. # One optional argument can specify the language used for eval script: matlab, octave or [default] python #make #if [ ! -e text8 ]; then # if hash wget 2>/dev/null; then # wget http://mattmahoney.net/dc/text8.zip # else # curl -O http://mattmahoney.net/dc/text8.zip # fi # unzip Text8.zip # rm text8.zip #fi CORPUS=counts. TXT #CORPUS COOCCURRENCE_FILE=cooccurrence.bin COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin BUILDDIR=build SAVE_FILE=vectors VOCAB_MIN_COUNT=5 # VECTOR_SIZE=300 # MAX_ITER=15 # WINDOW_SIZE=15 # BINARY=2 NUM_THREADS=8 X_MAX=10Copy the code
Iii. Training model
Start by executing make in the glove directory (MinGW may need to be installed if it doesn’t exist)
make
Copy the code
When the instruction is executed, a build folder is generated, which generates the dependency files needed for training.
Then run demo.sh:
Sh or bash demo.sh or sh demo.shCopy the code
At the end of the training, we get three vector files:
Each line in vocab.txt is a word and its frequency
TXT contains the vectors of the words we need