♚ \
Author: Yishui Hancheng, CSDN blog expert, personal research interests: machine learning, deep learning, NLP, CV
Blog: yishuihancheng.blog.csdn.net
Internet data information explosion era, every day there are a large number of data generation, transmission and accumulation, especially in social networking, microblogging has become a very early information centre, a lot of research work based on weibo data analysis has been carried out in succession, this paper is based on the analysis of their study period for simple practical interpretation work, Realize the character classification system based on microblog data.
The schematic diagram of the overall project process is as follows:
It is mainly divided into four parts: data acquisition and acquisition module, data preprocessing module, text data vectorization module and classification model. The next part of the detailed description and explanation.
1. Data acquisition ****
As early as the beginning of the microblog data openness is higher, so the acquisition and obtain the difficulty is not great, also accumulated a certain amount of data resources, then the corresponding API interface were closed, and now an official measures very strict lead to large quantities of data acquisition work is difficult to effectively to carry out over a long period of time, Therefore, the data set here adopts historical data resources. Of course, micro-blog real-time data collection and crawler in small batches is relatively simple. If you are interested in it, you can directly go to my blog to use it.
2. Text data preprocessing ****
Due to randomness and irregularity of the network language made to weibo data set to climb there are a lot of invalid characters, dirty data and other information, affect the use of the follow-up, and the original data format need after dealing with the resolution would be converted into the formatted data, first need to the original text data preprocessing work, specific code is as follows:
def data_prepossing(one_line):
' ''Data content to remove invalid characters, pre-processing to remove: city and province characters to filter out the case of only one word name one_line: single enterprise name data'' '
sigmod_list=[', '.'. '.'('.') '.The '-'.The '-'.'\n'.' ' '.' ' '.The '*'.The '#'.' ' '.' ' '.', '.'['.'] '.'('.') '.The '-'.'. '.'/'.'】'.'【'.'... '.'! '.'! '.':'.':'.'... '.The '@'.'~ @'.'~'.'「一」'.' ' '.' ' '.'? '.'"'.'? '.'~'.'_'.' '.'; '.'pieces'.'①'.'2.'.'3.'.'④'.'⑤'.'⑥'.'⑦'.'today'.'pet-name ruby'.'attending'.'⑾'.'⑿'.'[13].'[14].'⒂'.'& quot; ']
for one_sigmod in sigmod_list:
one_line=one_line.replace(one_sigmod,' ')
if 'http' in one_line:
one_line=one_line.split('http') [0].strip()
if 'https' in one_line:
one_line=one_line.split('https') [0].strip()
return one_line
Copy the code
After getting relatively clean data, we can carry out word segmentation of the data of microblog, which is mainly achieved by stutter Chinese word segmentation tool. If further analysis of text content is needed after word segmentation, key word mining calculation and weight calculation can be realized based on TF-IDF and TextRank algorithm. The specific code implementation is as follows:
def seg(one_content,stopwords):
' 'One_content: single enterprise name data stopWords: list of stopwords' '
segs=jieba.cut(one_content,cut_all=False)
segs=[w.encode('utf8') for w inSeg_set =set(set(segs)-set(stopwords))return seg_set
def append_self_invalid_words(invalid_words='self_invalid_words.txt') :' ''Add the word file you want to remove (each line in the file is a word)'' '
if invalid_words is None:
word_list=[]
else:
with open(invalid_words) as f:
word_list=[one.strip() for one in f.readlines()]
print word_list
return word_list
def splitData2Words(data_list,invalidwords=None,stopwords=None):
' ''Load data, specify field (corpus) word segmentation for data'' '
if stopwords is None:
stopwords_list=[]
else:
with open(stopwords) as s:
stopwords_list=[one.strip() for one in s.readlines()]
self_invalid_words_list=append_self_invalid_words(invalid_words=invalidwords)
print len(stopwords_list)
if stopwords_list and self_invalid_words_list:
stopwords_list+=self_invalid_words_list
elif stopwords_list:
pass
else:
stopwords_list=self_invalid_words_list
print len(stopwords_list)
content_list=[]
for i in range(len(data_list)):
one_content=data_prepossing(data_list[i].strip())
one_handle=seg(one_content,stopwords_list)
one_line='/'.join(one_handle)
content_list.append(one_line)
return content_list
Copy the code
The code above realizes the word segmentation work of the data of micro-blog. After obtaining the word segmentation results of the original text, word vector transformation can be calculated.
Text vectorization ****
Vectorization of microblog text is mainly calculated based on weighted word vectorization of Word2VEc. First, vectorization calculation of corpus data is realized based on Word2VEc, and then vectorization representation of text is realized based on the form of weighted vector addition. The specific code is as follows:
def word2vecModel(data='data/use_handle_split/1007939525.txt',num=10,model_path='my.model') :' ''class gensim. Models. Word2vec. Word2vec (sentences = None, size = 100, alpha = 0.025, the window = 5, min_count = 5, max_vocab_size = None, Sample =0.001,seed=1, workers=3, MIN_alpha =0.0001, SG =0, HS =0, negative=5, cbow_mean=1,hashfxn=
,iter=5,null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000) '
' '
with open(data) as f:
data_list=[one.strip().split('/') for one in f.readlines() if one]
con_list=[one for one in data_list if len(one)>=num]
model=word2vec.Word2Vec(con_list,sg=1,size=100.window=5,min_count=1,
negative=3,sample=0.001, hs=1,workers=4)
model.save(model_path)
return con_list,model
def getDocVec(model,word_list=['很'.'good'.'soft shelled turtle') :' ''Vector to generate single text content (weighted sum of all word vectors)'' '
vec=np.array([0] *100,dtype='float32')
for word in word_list:
vec+=model[word]
return (vec/len(word_list)).tolist()
def vectorMainFunc(dataDir='data/use_handle_split/',save_path='feature.json') :' ''Document vectorization main function'' '
txt_list=os.listdir(dataDir)
id_list=[one.strip().split('. ') [0].strip() for one in txt_list]
map_dict={}
res_list=[]
for i in range(len(id_list)):
map_dict[id_list[i]]=i
for one_txt in txt_list:
one_path=dataDir+one_txt
one_id=one_txt.split('. ') [0].strip()
con_list,model=word2vecModel(data=one_path,num=10,model_path='models/'+one_id+'.model')
for one_doc in con_list:
one_doc_vec=getDocVec(model,word_list=one_doc)
one_doc_vec.append(map_dict[one_id])
res_list.append(one_doc_vec)
with open(save_path,'wb') as f:
f.write(json.dumps(res_list))
Copy the code
After the transformation calculation of the above code, we get the corresponding feature vector representation of the original text corpus data, which can be used for classification calculation of the classification model.
4. Text classification model ****
The text classification model here mainly selects the decision tree model and the random forest model, which is more basic and will not be explained too much here. The specific code is as follows:
def DTModel(data='feature.json',rationum=0.30,model_path='Results/DT.pkl') :' ''Using the decision tree model'' '
with open(data) as f:
data_list=json.load(f)
print data_list[0]
x_list=[one[:-1] for one in data_list]
y_list=[one[-1] for one in data_list]
X_train,X_test,y_train,y_test=split_data(x_list, y_list,ratio=rationum)
DT=DecisionTreeClassifier()
DT.fit(X_train,y_train)
y_predict=DT.predict(X_test)
print ('DT model accuracy: ', DT.score(X_test,y_test))
saveModel(DT,save_path=model_path)
def RFModel(data='feature.json',rationum=0.30,model_path='Results/DT.pkl') :' ''Using a random forest model'' '
with open(data) as f:
data_list=json.load(f)
print data_list[0]
x_list=[one[:-1] for one in data_list]
y_list=[one[-1] for one in data_list]
X_train,X_test,y_train,y_test=split_data(x_list, y_list,ratio=rationum)
RF=RandomForestClassifier()
RF.fit(X_train,y_train)
y_predict=RF.predict(X_test)
print ('RF model accuracy: ', RF.score(X_test,y_test))
saveModel(RF,save_path=model_path)
Copy the code
The above four parts of the work are also presented in detail the specific code realization, are interested can practice, in the experimental part, I selected the four id data, according to four different character, in order to be more intuitive to see the data of different disposition form, we have done a simple word cloud based display, specific as follows:
Character 1:
Character 2:
Character 3:
Personality 4:
The degree of sensory differentiation is relatively high. We have made a simple visual analysis on the classification results of decision trees, as shown below:
Here is the end of the work of this article, I am very glad to review my knowledge while writing down something to share, if you think my content can or is enlightening and helpful to you, also hope to get your encouragement and support!
Appreciate the author
Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.
Click to become a registered member of the community