I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details
Word2Vec goods poetry
I’m afraid Word2Vec has come to light, but we usually use Word2Vec are using articles to train, because I prefer poetry, so specially use more than fifty thousand Tang poems and two hundred and sixty thousand Song words to train Word2Vec, the process is as follows:
- Use only the main content of each poem or word, cutting out irrelevant content such as title, author, notes, etc
- Each line is a poem or word, and each word is separated by a space
- Since many words in the poems are artistic conceptions, jieba is used to divide all the poems or words into words. Each line is a poem or word, and each word is separated by a space, which is appened to the song_tang. TXT file
- 100 epochs were trained using Word2Vec in Gensim to obtain the model
The training code is simple:
# -*- coding: utf-8 -*- import logging from gensim.models import word2vec def main(): logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = word2vec.LineSentence("song_tang.txt") model = word2vec.Word2Vec(sentences, Vector_size =250, epochs=100) # Save the model for later use model.save("word2vec.model") if __name__ == "__main__": main()Copy the code
The test code is also simple:
# -*- coding: utf-8 -*- from gensim.models import word2vec from gensim import models import logging def main(): Model = models. Word2Vec. Load (' Word2Vec. Model) try: for word in [' month ', 'toad,' chang 'e ', 'rabbit', 'Mid-Autumn festival', 'moon', 'osmanthus',' lanterns', 'cake', 'osmanthus'] : q_list = word.split() if len(q_list) == 1: Most_similar (q_list[0],topn = topn) print(" q_list[0],topn = topn ") for item in res: print(word,'-->',item[0]+","+str(item[1])) except Exception as e: print(repr(e)) if __name__ == "__main__": main()Copy the code
The main selection and the Mid-Autumn festival related to the common image including: the month, toad, chang ‘e, rabbit, Mid-Autumn, moon, osmanthus, lanterns, would have to measure the “moon cake”, but in such a large poem collection did not cut out the “moon cake” this word, did the ancients call “moon cake” another name? Well, if that’s true then I could win a Nobel Prize for this discovery. The test content is as follows, and the images are basically similar from the results:
Month --> shadow,0.5359682440757751 month --> snow,0.5294367074966431 month --> sun, 0.51675784587860November --> moon,0.5035041570663452 month Toad --> toad,0.4766398072242737 toad --> E,0.4235338866710663 toad Chang 'e --> Chang e,0.5356565117835999 Chang 'e --> Chang 'e --> Er --> er,0.44325411319732666 er --> chan Juan,0.44045501947402954 er --> chang,0.4265734851360321 similar words first 5 sorts Rabbit --> deer,0.46078577637672424 rabbit --> mouse,0.42531102895736694 rabbit --> fox,0.4184684455394745 rabbit --> animal,0.4146558344364166 rabbit --> Rabbit, 0.4029492483950bsa Mid-Autumn -> Chung Yang,0.5198861360549927 Mid-Autumn -> today night,0.5187360048294067 Mid-Autumn -> Qing Autumn,0.5057112574577332 Mid-Autumn --> This evening,0.4786747694015503 Mid-Autumn festival --> Yuanxiao,0.478258341550827 similar words top 5 sort moon --> Yin wind,0.42945602536201477 moon --> I do not know what evening,0.40893059968948364 moon appreciation -> Go,0.4009520709514618 moon appreciation -> ever,0.3729349374771118 moon appreciation -> three kingdoms, 0.372023066139221 similar words before 5 Sort osmanthus --> Osmanthus,0.46107402443885803 osmanthus --> Osmanthus,0.44184908270835876 Osmanthus --> Chrysanthemum,0.4242440164089203 Osmanthus --> Osmanthus fragrans floating,0.40674737095832825 Osmanthus fragrans --> Osmanthus branches,0.3999846577644348 Top 5 Arrangement lanterns --> Ingenious,0.48819929361343384 lanterns --> Xiyi,0.48614564538002014 Lantern --> Secret note,0.450831800699234 Lantern --> Weixiao,0.4482809007167816 Lantern --> copper pot,0.4456891119480133Copy the code
LSTM writes an epilogue
It has been a common operation for artificial intelligence to write poems, because LSTM can well capture the semantic relationship of the context. LSTM is suitable for short and concise texts like poems. If you don’t understand this structure, you can read my previous LSTM hardcore Basics explanation. Here is a brief introduction of data processing. Other model building and training procedures are basically unchanged and will not be described here.
Text is data with strong time series features, so is poetry. Each poem is composed of words with rich meanings. When we input the current word, we need to predict the next word, which determines our input format, for example:
^ The Yellow River enters the sea at the end of the mountain in the daytime. We set the maximum length of the poem to 20, and use "*" to indicate the unknown symbol to fill it into: ^ Because the input of the current moment is needed to predict the output of the next moment, move the string above to the right as a whole, that is: y[:-1] = x[1:] target: The day according to the mountain, the Yellow River into the sea $******** finally use the dictionary, the above string of each word is mapped to a number to complete the data pretreatmentCopy the code
When the model is well trained, the process of using the model to write poems is similar to the above. Each time, the word of the last moment and the state of the last moment are input to predict the current word, and the end conditions are known to meet, such as how many words are enough, or the predicted period is reached.
Then let’s start testing to write poems, mainly using some common characters and topics in the Mid-Autumn Festival, to tell the truth, a little artificial mental retarded feeling, but we have taken the first step, then we can continue to improve the quality of writing poems, after all, the model is also a little simple, should be a scene, a picture of joy, to help the Mid-Autumn Festival fun.
Chang 'e is a star, lam calendar seven years. Mercedes 1000 minutes, under the moon teng wings. The generation of Tibetan poem --> Wu Gang cut the GUI Wu turn tang Huang view, just town selection door. Felling lack of life in the sky, GUI under the emperor. Generation of Tibetan poem --> Mid Autumn reunion atrium peep build by laying bricks or stones, autumn wind shade green. Reunion fan failure, clanking auspicious rain. Mid Autumn Festival reunion and family joy In the Central Plains wind tightly moving Ming, autumn tree three rivers lead the way long. Group warbler cover shadow looking for fairy book, round and bright moon according to the frost round. He chapter close in only before and after, home will be qi retreat trapped float. Joy dream diaphragm drum, music bird bi Yun empty. Chang 'e run to the moon wide cold chengxian Chang 'e how much worship, mofrost 500 branches. Run lang to avoid the ground, the moon wine move danfan. Broad Hao Qing cloud table, cold light wind. Cheng Zi plain wing, xian Dian is still.Copy the code
This article to be continued…
Thank you
- The poems are collected by this warehouse. Thank you for your hard work: github.com/chinese-poe…
- LSTM writing poems can refer to this warehouse, the code is concise and clear, strongly recommended: github.com/wzyonggege/…