In order to further apply the huge corpus, we plan to use LSTM-RNN for training. The first step is to transform Chinese corpus into vector form that can be recognized by algorithm. The most powerful word embedding tool is Word2vec. In this section, I would like to share with you how I generate word vector from the corpus of 30 million movie and TV series subtitles
Please respect the original, reproduced please indicate the source website www.shareditor.com and the original link address
Cut words into the corpus
The input of word2vec needs to be a text file with cut words, but our movie and TV show subtitle corpus is a complete sentence separated by carriage return and line feed, so we cut words for it first. For the method of Chinese word segmentation, please refer to “Full Stack Developer 34 — Efficient Chinese Word Segmentation based on Python”. In order to cut words into the corpus of subtitles of movies and TV series, we will create word_segment.py file with the following contents:
# coding:utf-8 import sys reload(sys) sys.setdefaultencoding( "utf-8" ) import jieba from jieba import analyse def segment(input, output): input_file = open(input, "r") output_file = open(output, "w") while True: line = input_file.readline() if line: line = line.strip() seg_list = jieba.cut(line) segments = "" for str in seg_list: segments = segments + " " + str output_file.write(segments) else: break input_file.close() output_file.close() if __name__ == '__main__': if 3 ! = len(sys.argv): print "Usage: ", sys.argv[0], "input output" sys.exit(-1) segment(sys.argv[1], sys.argv[2]);Copy the code
Usage:
python word_segment.py subtitle/raw_subtitles/subtitle.corpus segment_resultCopy the code
The content of the generated Segment_result file looks like this:
... I'm sorry I'm not forthright I can only say it in dreams Before I lose my mind Now I just wanna see you crying in the moonlight at night I can't call you in the night what do I do with my heart like a kaleidoscope like the moon's light Led me to meet you countless times the twinkling stars of the moment to tell the direction of love are also born on earth miracle romance I believe in miracle romance where is the bunny where did you find it nowhere to see it By the enemy... The birth of Black Lady The unpleasant memories of the past will leave scars deep in your heart. Think of your wicked mother and your cruel father. The frog will fall Mom can cry the most hate dad pull me up stand up don't reach out to you parents is not love you evidence is you fall down not good stand up quickly recall those more hateful things how a person standing in a daze today is me What's your birthday? Yeah, but Dad...Copy the code
Take a look at this file as follows:
[root@centos $] ls-lh Segment_result-rw-r --r-- 1 Lichuang staff 1.1g 10 9 18:59 Segment_result [root@centos $] wc segment_result 0 191925623 1093268485 segment_resultCopy the code
The 0 line, because there is no carriage return at the end of the line, is 191925623 columns, 1093268485 characters
Generate word vectors with word2vec
For more information about word2vec, please see “Do it yourself chatbot 25 – Google text mining deep learning tool Word2vec implementation principle”. If you can’t get word2vec because of the wall, you can go to github.com/warmheartli… Get, make and when it’s compiled it’s going to generate some binary files, and I’m going to use word2vec
./word2vec -train .. /segment_result -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 Starting training using file .. /segment_result Vocab size: 260499 Words in train file: 191353657 Alpha: 0.039254 Progress: 21.50% Words/thread/ SEC: 96.67 k.Copy the code
Vectors. Bin is the word vector we want, but in binary form, we can use word2vec’s built-in distance tool to verify:
/distance vectors. Bin Enter word or sentence (EXIT to break): word: Position in vocabulary: 722 Word Cosine short -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- beautiful 0.532610 0.440603 figure 0.430269 Beautiful 0.413831 bar 0.410241 handsome 0.409414 job 0.407550 good 0.402978 good 0.401329 cute 0.399667 kiwi 0.391512 good 0.388109 match 0.387999 awesome 0.387924 Awesome 0.384184 awesome 0.377484...Copy the code
Word vector binary file format and loading
Please respect the original, reproduced please indicate the source website www.shareditor.com and the original link address
The binary format of the word vector generated by Word2vec looks like this:
Word number (space) vector Dimension First word (space) word vector (size 200*sizeof(float)) second word (space) word vector (size 200*sizeof(float))...Copy the code
So I wrote a Python script to load the word vector binary, which will be used later, as follows:
# coding:utf-8 import sys import struct import math import numpy as np reload(sys) sys.setdefaultencoding( "utf-8" ) max_w = 50 float_size = 4 def load_vectors(input): print "begin load vectors" input_file = open(input, Words_and_size = input_file.readline() words_and_size = words_and_size.strip() words = long(words_and_size.split(' ')[0]) size = long(words_and_size.split(' ')[1]) print "words =", words print "size =", Size word_vector = {} for b in range(0, words): a = 0 word = "" c = input_file.read(1) word = word + c if False == c or c == ' ': break if a < max_w and c ! = '\n': a = a + 1 word = word.strip() # read word vector = np.empty([200]) for index in range(0, size): m = input_file.read(float_size) (weight,) = struct.unpack('f', Word_vector [word.decode('utf-8')] = vector input_file.close() print "load vectors finish" return word_vector if __name__ == '__main__': if 2 ! Argv: print "Usage: ", sys.argv[0], "vectors. Bin "sys.exit(-1) d = load_vectors(sys.argv[1]) print d[u' true ']Copy the code
The operation mode is as follows:
python word_vectors_loader.py vectors.binCopy the code
The effect is as follows:
Begin Load Vectors Words = 49804 size = 200 Load Vectors Finish [-1.09570336 2.03501272 0.3151325 0.17603125 0.30261561 0.15273243-0.6409803 0.06317 0.20631203 0.22687016 0.59229285-1.10883808 1.12569952 0.16838464 1.27895844-1.18480754 1.6270808-2.62790298 0.43835989-0.21364243 0.05743926-0.77541786-0.19709823 0.33360079 0.43415883-1.28643405 -0.95402282 0.01350032-0.20490573 0.80880177-1.47243023-0.09673293 0.05514769 1.00915158-0.11268988 0.68446255 0.08493964 0.27009442 0.33748865-0.03105624-0.19079798 0.46264866-0.53616458-0.35288206 0.76765436-1.0328685 0.92285776-0.97560757 0.5561474-0.05574715-0.1951212 0.5258466 -0.07396954 1.42198348 1.12321162 0.03646624 -1.54316568 0.34798017 0.64197171-0.57232529 0.14402699 1.75856864-0.72602183-1.37281013 0.73600221 0.4458617 -1.32631493 0.25921029-0.97459841-1.4394536 0.18724895-0.74114919 1.50315142 0.56819481 0.37238419-0.0501433 0.36490002 -0.14456141 -0.15503241 -0.04504468 1.18127966 1.465729-0.13834922-0.1232961-0.14927664 0.67862391 2.46567917-1.10682511 0.71275675 1.04118025 0.23883103 -1.99175942 0.40641201 0.73883104 -0.37824577 0.88882846 0.87234962 0.71112823 0.33647302-1.2701565-1.15415645 1.41575384-2.01556969-0.85669023-0.0378141-0.60975027 0.0738821 0.19649875 0.02519603-0.78310513 0.40809572 0.55079561 1.79861426-0.01188554 0.14823757-0.97098011 -2.75159121 1.52366722-0.41585007 0.78664345 0.43792239 1.03834045 1.18758595 0.18793568-1.44434023-1.55205989 0.24251698 1.05706048-1.52376628-0.60226047-0.41849345-0.30082899-1.32461691 0.29701442 0.36680841-0.72046149 0.16455257-0.02307599-0.74143982 0.10319671-0.5436908-0.85527682-0.81110024-1.14968359-1.45617366 0.57568634 -1.10673392-0.48830599 1.38728273-0.46238521 1.40288961-0.92997569 0.90154368 0.09381612-0.61220604-0.40820527 1.2660408-1.02075434 0.98662543 0.81696391 0.06962785 0.83282673-0.12462004 1.16540051 0.10254569 1.03875697 0.05073663 1.50608146 0.49252063 0.09693919 0.38897502-0.0673333-0.30629408-2.1759603 0.5477249-1.46633601 1.54695141-0.83080739-0.49649978 1.05921662-0.60124737-0.72645563-1.44115663-0.6903789 0.38817915-0.11854757 0.18087701-0.41152322-0.98559368-1.46712041 1.63777673-0.64418262-0.56800991 1.79656076-0.80431151 0.99533188 0.06813133-0.73489577-0.67567319, 0.64855355]Copy the code
conclusion
In most of the examples above, I extracted partial corpus data for runtime reasons, so the numbers may be different when you actually execute. So far, we have managed to generate the word vector from the movie and TV series subtitle corpus and load it through Python. The next step is how to use it, please pay attention to it