There is such a network article ~ fire all over the country, full of characters, the real world, online game world, professional league three world lines parallel, and the interaction between different world lines rich, cross closely, also known as the network article world “CP encyclopedia” ~ it is “The Full-time Master”!

Of course, the plot so rich text natural length will not be small ~ yangyang asperse 500W words, afraid of not to let many readers daunting ~ I, in order to help the majority of readers clear character relations, determined to work with you through advanced technical means, with the way of text mining extraction network text content, choose high-quality network text works!! ~ \

Tool Preparation: jieba participle

\

To extract the key words in the novel, we need a tool — jieba segmentation. It’s a good name, huh? The ~! Let’s see how to use ~

* * * *

I. On the left side of the jieba band, write a sentence as the material of the word segmentation.

1import jieba 2txt = 'big summer, open air conditioning, eating watermelon, brush wechat, knocking code, don't mention how sour cool! 'Copy the code

\

Ii. Use the cut() on jieba to cut the sentence and return a generator. As long as it is a generator, it can be read using traversal.

1txt_cut = jieba.cut(TXT) 2print(txt_cut) #Copy the code

\

Iii. Separate the parsed words with slashes and see what happens. \

1result = ('/'.join(txt_cut)) # string The join() method is used to concatenate the elements in the sequence to produce a new string. 2print(result)Copy the code

Er… It’s not that bad, but some words that shouldn’t be taken apart are taken apart and some words that should be taken apart are shown. How can I do that? \

\

Iv. Add words manually as thesaurus to improve the accuracy of resolution. Method is jieba add_word ()

1# add thesaurus, 2jieba.add_word(word=' eating ') 3jieba.add_word(word=' running ') 4jieba.add_word(word=' running ') 5jieba.add_word(word=' running ')Copy the code

\

V. Add these words to the thesaurus so they can’t be split. Let’s see the effect: \

1txt_cut = jieba.cut(TXT) 2result2 = ("/". Join (txt_cut)) 3print(result2) 4" large/summer / /, / open/air conditioner /, / eat/water /, / check/wechat /, / knock/code /, / Don't mention it/it's/how/sour/cool!Copy the code

It’s better than that. All the words are refined. However, there are some elements in the sentence, such as punctuation, prepositions, particles and so on, which I don’t need in the analysis, so I have to eliminate them. \

\

Vi. Write a conditional judgment to delete unnecessary words. It worked out perfectly!

2txt_cut = jieba.cut(TXT) 3result3 = [w for w in txt_cut if w not in [' ',' ',' ',' ',', 4 ', '. ']] print (result3) 5 [' summer ', 'open', 'air conditioner', 'eating', 'watermelon', 'brush with', 'WeChat', 'on the', 'code', 'not', 'acid bright]Copy the code

Ok, the basic usage is introduced here ~ we carry out actual combat operation. \

\

Actual combat operation: analysis material preparation

\

“Full time master” this net article but the most popular “IP”!

I. Original text:

\

To remove unnecessary words from the analysis, I found this online:

Ii. A list of stop words

In information retrieval, to save storage space and improve search efficiency, some Words or Words are automatically filtered out before or after processing natural language data (or text). These Words or Words are called Stop Words.

Stop glossary part of content display \

\

Each novel has some special vocabulary, and to improve the accuracy of the words, we also need a special vocabulary:

Iii. A novel **** special thesaurus

Sogou cell thesaurus can download the novel corresponding thesaurus

Address: pinyin.sogou.com/dict/

\

\

Iv. Word library converter, the cell word library into TXT text format, easy to process.

The original text, stop word list, and special word library are all ready to write code! ~

\

Code editor: calculate key word frequency

\

I. Import some modules \ that will be used

1import numpy as np 2import pandas as pd 3import jieba 4import wordcloud 5from scipy.misc import imread 6import matplotlib.pyplot as plt 7from pylab import mpl 8import seaborn as sns 9mpl.rcParams['font.sans-serif'] = ['SimHei'] # Specify default font 10mpl.rcparams ['axes. Unicode_minus ']Copy the code

* * * *

Ii. Import stop words and convert them to list format.

1 stop_list = pd. Read_csv ('. / stop words. TXT ', engine = 'python', \ 2 encoding = "utf-8", names = [' t ']) [' t ']. Tolist ()Copy the code

\

Iii. Import the original text of the novel (it takes about 30 seconds to run, 500W for too many words in the novel)

1f = open('./ std.txt ',encoding='utf-8').read()Copy the code

\

Iv. Import the lexicon dictionary, which has a lot of words, using the load_userdict() method instead of the jieba.add_word() method

1jieba.load_userdict('dict.txt')
Copy the code

\

V. Set a word segmentation function, and the novel word segmentation

1def txt_cut(f):
2    return [w for w in jieba.cut(f) if w not in stop_list and len(w)>1]
3
4txtcut = txt_cut(f)
Copy the code

\

Vi. After word segmentation, we take the top 20 word frequency of the novel for a simple statistics, draw a bar chart for easy viewing

1word_count = pd.series (txtcut).value_counts().sort_values(ascending=False)[0:20] 2 3fig = plt.figure(figsize=(15,8)  = word_count.index.tolist() 5y = word_count.values.tolist() 6sns.barplot(x, y, Palette ="BuPu_r") 7plt.title(' word frequency Top20') 8plt.ylabel('count') 9sns.despine(bottom=True) PNG ('./ word frequency statistics. PNG ',dpi=400) 11plt.show()Copy the code

\

The chart and table of statistical results are roughly like this

\

\

Notice that the word “Ye Xiu” has a high frequency of 2W, so it can be inferred that he is the hero of this book; It is also said that this book is a novel about a game, so the word “Jun Mo xiao”, which ranks second in word frequency, should be the game character of the protagonist.

\

As an experienced reader, I can already infer the plot of the novel when I see twenty high-frequency words in front of me! ~! Let me give you a sentence:

\

Visual display: word cloud

\

Now, for fun, draw a word cloud of frequently used words. \

\

I. Instantiate a word cloud class and add a word participle.

FIG = plt.figure(figsize=(15,5)) 2cloud = wordcloud. Wordcloud (font_path='./ fzstk.ttf ', 3 mask = imread('./background.png'), 4 mode='RGBA', 5 background_color=None 6 ).generate(' '.join(txtcut))Copy the code

\

Class attributes: \

Font_path: font of words in the word cloud. Chinese words must be in Chinese font; otherwise, the display will be abnormal. Font files, you can look for your favorite fonts in the font folder on your computer.

Set the shape of the word cloud pattern. You can import the picture Settings, the picture should be simple, face-based, clarity does not matter, but the resolution must be higher, otherwise the word cloud will be very unclear. The image body should not be pure white because white is considered the background and will be ignored.

Here I import a full-time master’s Logo.

 

Mode: color Mode, here select RGBA. \

Background_color: indicates the background color.

\

Ii. Beautify the color of the word

Scipy and NUMpy are used here. We want to import the image that describes the background color. The image size should be greater than or equal to the resolution of the mask image. The computer maps to the word map based on the color of the background map. So the background color map is also not clear, but the color saturation should be high, the contrast should be obvious! That’s how it works.

 

1img = imread('./color.png')
2cloud_colors = wordcloud.ImageColorGenerator(np.array(img))
3cloud.recolor(color_func=cloud_colors)
Copy the code

\

I chose a family portrait of a full-time pro.

\

Iii. Call the Matplotlib interface and do some basic Settings.

1plt.imshow(cloud)
2plt.axis('off')
3plt.savefig('./wordcloud.png',dpi=400)
4plt.show()
Copy the code

\

Draw complete! Witness the miracle!

\

Relationships: CP diagram

\

Remember when I first said The King’s Avatar was a CP encyclopedia? Let’s dig deeper into the relationships below and give them a pair of CP’s.

\

The mining logic is as follows:

Walk through each line of the text, and then extract the characters in each line. If two characters appear in the same line at the same time, the intimacy degree between them will be +1. Finally, the one with the highest intimacy degree is the best CP.

\

First of all, we need to improve the thesaurus to screen out which words are people’s names

The dictionary structure is like this, one word in a line, there are [word, frequency, part of speech]

We don’t have to worry about the frequency of the word for the moment. I’ll write a 1 instead. Part of speech, I only need to identify the name of the person, so the name of the novel is set as NR, the rest of the words, excuse me, set as N (NR stands for name, N stands for noun. Other parts of speech meanings can be seen in the part of speech Comparison table)

\

I. Create a script and import the module.

1import jieba,codecs
2import jieba.posseg as pseg
3import pandas as pd
Copy the code

\

Codecs library: used to read the text, prevent the text encoding is not unified, resulting in errors; \

Jieba. posseg: Show the part of speech after the participle.

\

Ii. Create three containers for converting data.

1Names = {} # Extract the name of the person, and the frequency of occurrence. 2Relationships = {} # 3lineNames = []# Cache the name of each line.Copy the code

* * * *

Iii. Import thesaurus and text

Walk through each line of text for word segmentation. Exclude all words that are not NR and are longer than 3 and less than 2 in length (usually characters’ names are two and three characters).

\

Iv. Add the words that meet the requirements into the container respectively

1jieba.load_userdict('dict.txt') 2with codecs.open(' dict.txt','r','utf8') as f: 3 n = 0 4 for line in F.readlines (): Format (n)) 7 poss = pseg.cut(line) 8 lineNames. Append ([]) 9 for w in poss: 10 if w.flag! = 'nr' or len(w.word) < 2 or len(w.word) > 3: 13 continue 13 lineNames[-1]. Append (w.word)# Save the names extracted from each line as a behavior group. 14 If names.get(w.word) is None: 16 relationships[w.w.ord] = {} # 16 relationships[w.w.ord] = {} #Copy the code

* * * *

V. Walk through lineNames, matching the names in each line to establish character relationships

1for line in lineNames: 2 for name1 in line: 3 for name2 in line: 5 continue 6 if relationships[name1]. Get (name2) is None: 7 Relationships [name1][name2]= 1# Generate new key-value pairs for new relationships. 8 else: 9 Relationships [name1][name2] = Relationships [name1][name2]+ 1Copy the code

\

We will eventually simulate the relationships through Gephi. Gephi, on the other hand, requires data in a specific format to build relational networks.

Gephi is an open source and free cross-platform complex network analysis software based on JVM. It is mainly used as an open source tool for interactive visualization and exploration of various networks and complex systems, dynamic and hierarchical graphs.

\

Vi. Construct two arrays to classify point data and edge data

1node = pd.DataFrame(columns=['Id','Label','Weight'])
2edge = pd.DataFrame(columns=['Source','Target','Weight'])
Copy the code

\

Vii. Add the cleaned data to the array

1for name,times in names.items():
2        node.loc[len(node)] = [name,name,times]
3
4for name,edges in relationships.items():
5        for v, w in edges.items():
6            if w > 3:
7                edge.loc[len(edge)] = [name,v,w]
Copy the code

\

Viii. Export data for future use

1 edge. To_csv ('/edge (the original). CSV ', the index = 0) 2 node. To_csv ('. / node (the original). CSV ', the index = 0)Copy the code

\

Once the script is written, run it. There are 206436 lines in the novel, so it takes some time to calculate:

 

\

Only CP with close relationship is analyzed, so we only take data with weight above 100 in edge data. According to this standard, the collated point data and edge data are as follows: ** ** :

\

Finally, we import the data into Gephi.

\

Adjust colors and layouts based on your perception of beauty.

Okay, here’s the CP chart!

\

Uh huh! It’s fun! The relationships are incredibly complicated! Here can’t help but reveal the plot! \

\

\

CP Ye Chen: Ye Xiu is the captain of the team and is fully responsible for the management of the team. Chen Guo is the team owner’s wife, domineering side leak. The relationship between the two is similar to that between a president and a general manager. ~~ ah, this is not no problem the bully president fell in love with me.

Ye Tang CP: Tang Rou is the daughter of a rich family, talented girl. However, accidentally let ye Xiu belt bad, infatuated with the game. They’re kind of mentors and teammates. Gee ~

CP yesu: That’s amazing! Ye Xiu once knew a brother, called Su Mu autumn, two people play games together. Su Mu Qiu had a sister named Su Mu Cheng. Later, Su Mu Qiu died… What a mess your circle is!

\

All right, let’s give it away. In fact, if you are interested, you can import more data and continue to study the color of clusters in the relational network…

\

So, this sharing is over! Hope you see less net articles, learn more ~!

\

\

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

To learn Python, click below to read the article