♚ \

Blmoistawinde, a graduate of a high school in the southwest, likes interesting data mining analysis. Hoping to bring some fresh air to the world.

Personal blog address: blog.csdn.net/blmoistawin…

\

preface

\

I’ve always been interested in natural language processing, which can help us make a lot of discoveries from text, and social network analysis, which can help us learn more about the pervasive network of connections between people and things. When the two are combined, what is the magic?

\

As a fan of The Three Kingdoms, I had this idea: can I use text processing to get the social network of the characters in the Romance of The Three Kingdoms and then analyze it? There are plenty of great tools in Python to help me implement my curious ideas, so let’s get started.

\

The preparatory work

\

Get the text of Romance of The Three Kingdoms.

\

Chapters = Get_sanguo () # Text list, each element of a chapter print(Chapters [0][:106])Copy the code

\

The first banquet taoyuan heroes cut yellow turbans hero first meritorious service rolling in the East of the Yangtze River water, waves washed out heroes. Non - success or failure turn head empty. Green hills are still there, several degrees of sunset red. White hair fishing qiao Jiang Nagisa, used to see the autumn moon spring breeze. A pot of liquor happy to meet. How many things in the past and present, all pay jokesCopy the code

\

Romance of The Three Kingdoms is not an easy text to deal with. It is close to ancient prose, and we will be confronted with a series of aliases such as the font names of ancient people. For example, how does the computer know that “Xuande” refers to “Liu Bei”? That requires us to give it some knowledge. We learn that xuande is liu Bei’s character, and computers can connect the concept in a similar way. We need to tell the computer that “Liu Bei” is an entity (similar to a standard name for an object) and “Xuande” is a reference for “Liu Bei”, by providing the computer with a knowledge base.

\

Entity_mention_dict, entity_type_dict = get_sanguo_entity_dict() printCopy the code

\

Liu Bei's referents are: [' Liu Bei ', 'Liu Xuande ',' Xuande ', 'Jian Jun ']Copy the code

\

In addition to the human entity and reference, we can also include other types of reference such as The Three Kingdoms, for example, “Shu” can also be called “Shu Han”, so the knowledge base can also include the type of entity information to distinguish.

\

Print (" entity_type_dict ") print(" entity_mention_dict ") print(" entity_mention_dict ") The type of Liu Bei is the name of the people. The type of Shu is the force. Shu is referred to as shu, Shu Han.Copy the code

\

Armed with this knowledge, we could theoretically program to associate nicknames with entities. But if you have to start from scratch, there will be a lot of work involved. HarvestText[1] is a text processing library that encapsulates these steps to help us do this easily.

\

Ht = HarvestText() ht.entities (entity_mention_dict, entity_type_dict) # print(ht.seg) , standard_name = True)) [' shi bi ', ', ', 'worship', 'liu bei', 'as the brother', ', ', 'guan yu', 'times',', ', 'zhang fei', 'as the brother', '. ']Copy the code

\

\

Social networking

\

Having succeeded in unifying references to standard entity names, we can begin to mine the social networks of the three countries. The specific way is to use the neighborhood co-occurrence relationship. Whenever a pair of entities appear together in two sentences, an edge is added to them. The entire process of setting up a network looks like the following:

\

\

We can do this directly using the HarvestText function. Let’s try it out with the small text in Chapter 1:

\

Chapterchapters [0].Chapters (" Cao "," Cao ") Ch1_sentences = ht.cut_sentences(doc) # 分句 doc_ch01 = [ch1_sentences[I]+ch1_sentences[I +1] for I in Ht.set_linking_strategy ("freq") # build network G = ht.build_entity_graph(doc_ch01, Used_types =[" name "]) # Important_nodes = [node for node in g.nodes if g.desegree [node]>=5] G_sub = G.s ubgraph (important_nodes). Copy () draw_graph (G_sub, alpha = 0.5, node_scale = 30, figsize = (6, 4))Copy the code

\

What exactly is the relationship between them? We can use the text summary to get the details of this chapter:

\

For I,doc in enumerate(ht.get_summary(doc_ch01, topK=3, ): print(I,doc) : print(I,doc) Song said, "Zhang Liang and Zhang Bao are poor, so they will go to The court of Guangzong to assist Zhang Jiao. At that time, Zhang Jiao had 150, 000 soldiers and 50, 000 soldiers. Zhi told Xuande, "I am a traitor in yingchuan, and my younger brothers Zhang Liang and Zhang Bao are fighting against Huangfu Song and Zhu Jun. The next day, in the peach garden, they prepared the sacrificial offerings of black oxen and white horses, burned incense, worshiped and swore, saying: "Read liu Bei, Guan Yu and Zhang Fei, although they have different surnames, they will work together as brothers.Copy the code

\

The main content of this chapter seems to be the story of Liu Guan and Zhang Taoyuan’s three sworn enemies and their joint fight against the yellow scarf thief.

\

Drawing the entire network of the three countries

\

With the basis of small-scale practice, we can use the same method to integrate the content of each chapter and draw a big picture across The Three Kingdoms. \

\

G_chapters = [] for chapter in chapters: Docs = [sentences[I]+sentences[I +1] for I in range(len(sentences)-1)] G_global = nx.graph () for G0 in G_chapters: Chapters.append (ht.build_entity_graph(docs, used_types=[" name "]) for (u,v) in G0.edges: if G_global.has_edge(u,v): G_global[u][v]["weight"] += G0[u][v]["weight"] else: G_global.add_edge(u,v,weight=G0[u][v]["weight"]) # max(nx.connected_components(G_global), key=len) G_global = G_global.subgraph(largest_comp).copy() print(nx.info(G_global)) Name: Type: Graph Number of nodes: 1290 Number of edges: 10096 Average degree: 15.6527Copy the code

\

The entire social network has 1,290 people and tens of thousands of edges! So it’s almost impossible for us to draw it, so let’s pick the key players and draw a subset.

\

important_nodes = [node for node in G_global.nodes if G_global.degree[node]>=30]
G_main = G_global.subgraph(important_nodes).copy()
Copy the code

\

Visualization with Pyecharts

\

From Pyecharts import Graph Nodes = [{"name": "node 1", "value":0, "symbolSize": 10} for i in range(G_main.number_of_nodes())] for i,name0 in enumerate(G_main.nodes): Nodes [I]["name"] = name0 Nodes [I]["value"] = g_main. degree[name0] Nodes [I]["symbolSize"] = g_main. degree[name0] / 10.0 links = [{"source": "", "target": ""} for i in range(G_main.number_of_edges())] for i,(u,v) in enumerate(G_main.edges): Links [I]["source"] = u links[I]["target"] = v links[I]["value"] = G_main[u][v]["weight"] graph = graph Graph.add ("", Nodes, links) graph.render("./images/ Three Nodes. HTML ") graphCopy the code

The blog can’t display interactive charts. Here’s a screenshot: it shows Liu Bei’s adjacency node

\

\

The whole network is complex, behind the story of The Three Kingdoms is countless southern expeditions and northern expeditions, intrigues. But with the power of computers, we can still tease out some key clues, like:

\

Person rank – Importance

\

To solve this problem, we can use the sorting algorithm in the network. PageRank is a typical way for search engines to rank search results using links between sites, but the same is true for links between people. Let’s get to the top 20:

\

page_ranks = pd.Series(nx.algorithms.pagerank(G_global)).sort_values()
page_ranks.tail(20).plot(kind="barh")
plt.show()
Copy the code

\

\

Even if you are not familiar with The Three Kingdoms, you must be familiar with these characters.

\

Character Rank – Power value

\

This question looks similar to the one above, but there are differences. Just as the most popular person is not necessarily the leader, the person who can act as a unifying force at the center of the team and make members work together is the most powerful person. Centrality is one such indicator. Who are the most powerful people in the three countries? \

\

between = pd.Series(nx.betweenness_centrality(G_global)).sort_values()
between.tail(20).plot(kind="barh")
plt.show()
Copy the code

\

\

The result is indeed different from the above ranking, we see Liu Bei, Cao Cao, Sun Quan, Yuan Shao and other lords are in the top. Another interesting finding is that Sima Yi, Sima Zhao, Sima Shi and his son are also on the list, while the other descendants of Cao’s family are not named, indicating that sima’s power is dominated by the opposition. Sima’s heart seems to be revealed by big data like this!

\

Community found

\

People are related to each other, so groups tend to form. The community discovery algorithm in social network analysis allows us to discover these groups, so let me use the algorithm provided in the Community library [2] to uncover these relationships.

\

Import community # python-louvain partition = commun.best_partition (G_main) # louvain partition comm_dict = defaultdict(list) for person in partition: comm_dict[partition[person]].append(person)Copy the code

\

In the following three communities, we mainly see the important officials of wei, Shu and Wu. (There are only a few minor “problems”, but interestingly, computers don’t know which faction they belong to, just use algorithms.)

\

draw_community(2)
Copy the code

\

Community 2: Zhang Liao cao Ren Xia Hou Dun Xu Huang Cao Hong Xia Hou Yuan Zhang He Xu Chu Lejin Li Dian Yu Ban Xun Yu Liu Ye Guo Jia Manchong Cheng Yu Xun You Lv Qiandian Wei Wenping Dong Zhao MAO JieCopy the code

\

draw_community(4)
Copy the code

\

\

Community 4: Cao Zhuge Liang, Liu Bei, Guan Yu, Zhao Yun, Zhang Fei, Ma Chao, Huang Zhong, Xu Chang, Meng Da [Wei], Sun Gan, Cao Anmin, Liu Zhang, Guan Ping, Pang De fazheng, Yi Ji, Zhang Lu, Liu Feng, Meng Suo, Yan Ma, Liang Jian, Yong CAI, Tao Qian, Kong Rong, Liu Cong, Xia Hou, Liu Biaozi, Zhou Cang, Chen DengCopy the code

\

draw_community(3)
Copy the code

\

\

Community 3: Sun Quan, Sun Ce, Zhou Yu, Lu Xun, LV Meng, Ding Feng, Zhou Tai, Cheng Pu, Han Dang, Xu Sheng, Zhang Zhao, Ma Xiang, Huang Gai, [Wu], Pan Zhang, Gan Ning, Lu Suning, Tong Shi Ci, Zhuge Jin, Han Wujun, Jiang Qin, Huang Zu, Kan Ze, Zhu Huan, Chen Wu, Lv Fan

\

draw_community(0)
Copy the code

\

\

Sun Jian guo Si Chen Gong Ma Teng Yuan Shang Han Sui Gong Sun Zan Gao Shun Xu You Guo Tu Yan Liang Yang Feng Zhang Xiu Yuan Tan Dong Cheng Wen Chou He Jin Zhang Miao yuan XiCopy the code

\

There are other communities. Here, for example, in the early period of The Three Kingdoms, the lords Sun Jian, Yuan Shao, Dong Zhuo and so on competed with each other with great excitement.

\

draw_community(1)
Copy the code

\

\

community 1: Sima Yi, Wei Yan, Jiang Wei, Zhang Yi, Ma Dai, Liao Hua, Wu Yi, Sima Zhao, Guan Xing, Wu Ban, Wang Ping, Deng Zhi, Deng Ai, Zhang Bao, Qiao Zhou, Ma Shu, Cao Zhen, Li Hui, Huang Quan, Zhong Hui, Jiang Wan, Zhang Yi, Yang Hongxu, Fei Shi, Li Yan, Guo Huai, Cao Xiufan, Jian Qin, Mi, Xia Hou Ba Yang Yi, GAO Xiang, ZHANG Nan, [WEI], HUA Xin, CAO Shuang-Zhenzheng, XU Yun, [Wei], WANG Lang, [Si Tu], DONG Jue, DU Qiong, Huo Jun, HU Ji, Jia Chong, PENG Bi, Wu LAN, ZHU Ge Dan, Lei Tong, Sun 綝, Zhuo Ying, Fei Guan, Du Yi, Yan Yan, Shengbo, LIU Min-Liu, Du Qi, Shangguan, 雝, Ding Xian, Cuan, Fan Qi, Cao Fang, Zhou QunCopy the code

\

This community was a major figure in the later period of The Three Kingdoms. The story behind this network is that two generations of Sima’s three men defeated the shu Han group led by Jiang Wei, swept away the Cao family forces in wei, and finally reached the peak of power.

\

Dynamic network \

\

It’s an interesting task to study how social networks change over time. However, Romance of The Three Kingdoms is roughly narrated according to the time line and has a very long time span. As the story line goes down, what changes will happen to the social network?

\

Here, I take the text of 10 chapters as the span, and record the social network in the current span every 5 chapters, which is equivalent to leaving a snapshot. By connecting these snapshots, we can see an animation of the changes of a social network. We still use NetworkX for snapshots, and moviepy for animations.

\

Let’s take a look at which group of people take center stage at each stage of the story.

\

import moviepy.editor as mpy from moviepy.video.io.bindings import mplfig_to_npimage width, Range0 = range(0,len(G_chapters)-width+1,step) numFrame, FPS = len(range0), 1 duration = numFrame/fps pos_global = nx.spring_layout(G_main) def make_frame_mpl(t): i = step*int(t*fps) G_part = nx.Graph() for G0 in G_chapters[i:i+width]: for (u,v) in G0.edges: if G_part.has_edge(u,v): G_part[u][v]["weight"] += G0[u][v]["weight"] else: G_part.add_edge(u,v,weight=G0[u][v]["weight"]) largest_comp = max(nx.connected_components(G_part), key=len) used_nodes = set(largest_comp) & set(G_main.nodes) G = G_part.subgraph(used_nodes) fig = Plt. figure(figsize=(12,8),dpi=100) nx.draw_networkx_nodes(G,pos_global,node_size=[g.degree [x]*10 for x in g.des]) # Nx. Draw_networkx_edges (G, pos_global) nx. Draw_networkx_labels (G, pos_global) PLT. Xlim ([1, 1]) PLT. Ylim ([1, 1]) Plt.axis ("off") plt.title(f" {I +1} to {I +width+1} ") return mplfig_to_npImage (FIG) animation = VideoClip(make_frame_mpl, duration=duration) animation.write_gif("./images/ three social network changes.gif ", FPS = FPS)Copy the code

\

For aesthetic reasons, the edges in the network are omitted from the animation.

\

\

As time goes by, people who once stood at the center of the stage of history will gradually leave. As it says at the beginning of Romance of The Three Kingdoms:

\

How many things in the past and present, all pay jokes.

\

So much for today’s joke about python…

\

This article code address:

Github.com/blmoistawin…

\

Note:

\

[1] HarvestText is the work of harvestText, which is open source on Github and installed directly through PIP. It aims to make it easier for users to do text data analysis like this one. In addition to the functions involved in this paper, there are also emotional analysis, new word discovery and other functions. If you think it’s useful, why not give it a try and see if you can find more interesting and useful findings on the text you’re interested in?

\

[2] The Commutity library, originally named Python-Louvain, uses the same Louvain algorithm built into Gephi for community discovery

\

[3] Due to the difficulty of processing ancient prose, there are still some obvious mistakes in this paper, I hope you don’t mind ~

\

§ § \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

\

More recommended

\

Python iterators use details \

\

Learn about Python iterables, iterators, and generators

\

Use Python to crawl financial market data \

\

Build CNN model to crack website captcha \

\

Image recognition with Python (OCR)

\

Email: [email protected]

\

**** Free membership of the Data Science Club ****