Python Chinese Community
The spiritual tribe of Python Chinese developers around the world
.jpg”)
\
More recently, natural language processing (NLP) is a larger branch of machine learning. There are many challenges. For example, how to divide words, identify entity relationships, inter-entity relationships, relationship network display and so on.
I did a natural language analysis on Jieba + Word2vec + NetworkX. The corpus is to lean on the sky and slay the dragon. Before, many people used Jin Yong’s wuxia novels to analyze and deal with, hoping to bring something different. Take a few screenshots:
Link the similarity diagram of all characters. \
Same relation as above. The display form is polycentric
Network diagram with different identities of Zhang Wuji as the center. \
The main differences in this analysis are:
1. Similarity result of Word2Vec – as the weight of social network in the later stage
2, NetworkX analysis and presentation
The combination of the above two methods can significantly reduce the amount of time spent reading articles in your daily work. With machine learning, entity information in articles can be extracted semi-automatically from beginning to end, saving a lot of time and cost. In a variety of work has the use of the scene, if interested in friends, can contact cooperation.
Let’s start with what we can find with Word2Vec+NetworkX.
I. Analysis results
Different attributes of entities (Zhang Wuji’s Total Multiple Vests)
Zhang Wuji, Wuji, Master Zhang, brother Wuji, Childe Zhang. The same Zhang Wuji has multiple identities, different identities and different people, there is different similarity.
First look at the picture:
Brother Mowgli is too close to the name, generally do not shout. It’s almost as if the similarities are with strange characters. \
Wuji is the name that equals or elders can call after the relationship is mature. There are Miss Zhou, Miss Yin and so on
Zhang Wuji is a common name, everyone can call and majia close contact.
Childe Zhang is a polite and respectful title. For example, yellow woman, Ruyang Wang and so on
Master Zhang is a title. It’s respectful, but it’s not very familiar, and sometimes it’s hostile. For example: Zhu Yuanzhang
Note:
1, The graph is drawn by Networkx based on Word2vex, the above description is my manual analysis.
2. Zhao Min is not in the network diagram above. Word2Vec calculated that Zhang Wuji and Zhao Min similar degree is not too high. Something I didn’t expect. Careful recall, that year when reading this book, suddenly found two people together, seems more abrupt. Presumably, in the book, two people get married, but in the real world, their relationship is more precarious.
Second, the implementation process
Main steps:
Prepare corpora
- The text file of the novel “Heaven Leaning and Dragon Slaying”
- Custom participle dictionary (fictional character names, about 180 available online)
- Stop list
The preparation of the instruments
- Python Pandas, Numpy,Scipy
- Jieba (Chinese word segmentation)
- Word2vec (Word vectorization tool to calculate the detail between words)
- Networks (network diagram tool, used to show complex network relationships
Data preprocessing
- Forward text files to UTF8 (pandas)
- Jieba text File Clause
- Text file clause, word segmentation, part of speech analysis, mainly name (Jieba)
- Update the custom dictionary and reclassify the words (this process takes several times until you are satisfied)
- A small number of manual deletion (segmentation out of the name of the misjudgment rate is not high, but there are some. For example: Zhao Minxiao said, can be recognized by a person named Zhao Minxiao. This part of the work has to be done by hand. Unless there is a better word segmentation tool, or a word segmentation tool that can be trained, this problem will not be solved.
Word2Vec training model. This model can calculate the similarity between two people
- Use 300 dimensions
- Filter word frequency less than 20 times
- The sliding window is 20
- Lower sampling: 0.001
Generate entity relationship matrix.
- I couldn’t find a library online, so I wrote one myself.
- N * N dimensions. N is the number of names.
- Use WordVec’s model above to populate the entity-relationship matrix
NetworkX Generates a network diagram
- The node is the name of the person
- An edge is a line between two nodes. That’s the relationship between two people.
Three, part of the code implementation (due to space is limited, to obtain the complete code please pay attention to the public number programming dog reply 0321 to obtain)
Initialize the
import numpy as np
import pandas as pd
import jieba
import jieba.posseg as posseg
%matplotlib inline
Copy the code
Data segmentation, cleaning
renming_file = "yttlj_renming.csv" jieba.load_userdict(renming_file) stop_words_file = "stopwordshagongdakuozhan.txt" stop_words = pd.read_csv(stop_words_file,header=None,quoting=3,sep="\t")[0].values corpus = "yttlj.txt" yttlj = pd.read_csv(corpus,encoding="gb18030",header=None,names=["sentence"]) def cut_join(s): New_s =list(jieba.cut(s,cut_all=False)) #print(list(new_s)) stop_words_extra =set([""]) for seg in new_s: if len(seg)==1: #print("aa",seg) stop_words_extra.add(seg) #print(stop_words_extra) #print(len(set(stop_words)| stop_words_extra)) new_s =set(new_s) -set(stop_words)-stop_words_extra result = ",". Join (new_s) return result def Words =[] flags=[] for k,v in new_s: if len(k)>1: words.append(k) flags.append(v) full_wf["word"].extend(words) full_wf["flag"].extend(flags) return len(words) def check_nshow(x): nshow = yttlj["sentence"].str.count(x).sum() #print(x, nshow) return nshow # extract name & filter times full_wf={"word":[],"flag":[]} possible_name = yttlj["sentence"].apply(extract_name) #tmp_w,tmp_f df_wf = pd.DataFrame(full_wf) df_wf_renming = df_wf[(df_wf.flag=="nr")].drop_duplicates() df_wf_renming.to_csv("tmp_renming.csv",index=False) df_wf_renming = pd.read_csv("tmp_renming.csv") df_wf_renming.head() df_wf_renming["nshow"] = df_wf_renming.word.apply(check_nshow) df_wf_renming[df_wf_renming.nshow>20].to_csv("tmp_filtered_renming.csv",index=False) Df_wf_renming [df_wf_renming.nshow>20]. Shape Df_wf_renming =pd.read_csv("tmp_filtered_renming.csv") my_renming = df_wf_renming.word.tolist() external_renming = pd.read_csv(renming_file,header=None)[0].tolist() combined_renming = set(my_renming) |set(external_renming) pd.DataFrame(list(combined_renming)).to_csv("combined_renming.csv",header=None,index=False) combined_renming_file ="combined_renming.csv" jieba.load_userdict(combined_renming_file) # tokening yttlj["token"]=yttlj["sentence"].apply(cut_join) yttlj["token"].to_csv("tmp_yttlj.csv",header=False,index=False) sentences = yttlj["token"].str.split(",").tolist()Copy the code
Word2Vec vectorization training
# Set values for various parameters
num_features = 300 # Word vector dimensionality
min_word_count = 20 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 20 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words
# Initialize and train the model (this will take some time)
from gensim.models import word2vec
model_file_name = 'yttlj_model.txt'
#sentences = w2v.LineSentence('cut_jttlj.csv')
model = word2vec.Word2Vec(sentences, workers=num_workers, \
size=num_features, min_count = min_word_count, \
window = context, \
sample = downsampling
)
model.save(model_file_name)
Copy the code
Establish entity relationship matrix
entity = pd.read_csv(combined_renming_file,header=None,index_col=None) entity = entity.rename(columns={0:"Name"}) entity = entity.set_index(["Name"],drop=False) ER = pd.DataFrame(np.zeros((entity.shape[0],entity.shape[0]),dtype=np.float32),index=entity["Name"],columns=entity["Name"]) ER["tmp"] = entity.Name def check_nshow(x): nshow = yttlj["sentence"].str.count(x).sum() #print(x, nshow) return nshow ER["nshow"]=ER["tmp"].apply(check_nshow) ER = ER.drop(["tmp"],axis=1) count = 0 for i in entity["Name"].tolist(): count +=1 if count % round(entity.shape[0]/10) ==0: print("{0:.1f}% relationship has been checked".format(100*count/entity.shape[0])) elif count == entity.shape[0]: print("{0:.1f}% relationship has been checked".format(100*count/entity.shape[0])) for j in entity["Name"]: relation =0 try: relation = model.wv.similarity(i,j) ER.loc[i,j] = relation if i! =j: ER.loc[j,i] = relation except: relation = 0 ER.to_hdf("ER.h5","ER")Copy the code
NetworkX shows people diagrams
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pygraphviz
from networkx.drawing.nx_agraph import graphviz_layout
Copy the code
\
In this paper, the author
Yong Wang, Python Chinese community columnist, snowball ID: happy dad, currently interested in business analysis, Python, machine learning, Kaggle. 17 years project management, 11 years in communications project manager contract delivery, 6 years in manufacturing project management: PMO, change, production transfer, liquidation and asset disposal. MBA, PMI – the PBA, PMP.
\
Due to limited space, to obtain the complete code, please pay attention to the public number programming dog reply 0321 to obtain
Become a free member of the Python Chinese community