[Python Artificial Intelligence] xxiii.Word2Vec+CNN Chinese text classification and comparison with Machine Learning Algorithms by Eastmount
I. Text classification
Text classification aims to automatically classify and mark text sets according to certain classification system or standard, which belongs to an automatic classification system based on classification system. Text classification can be traced back to the 1950s. At that time, text classification was mainly carried out by rules defined by experts. The 1980s saw the emergence of expert system based on knowledge engineering. Since 1990s, text classification has been carried out by artificial feature engineering and shallow classification model with the help of machine learning. Nowadays, word vector and deep neural network are used for text classification.
Teacher Niu Yafeng summarized the traditional text classification process as shown in the figure below. In the traditional text classification, basically most machine learning methods are applied in the field of text classification. Mainly include:
- Naive Bayes
- KNN
- SVM
- Random forest decision tree
- Collection class method
- Maximum entropy
- The neural network
The basic process of text classification using Keras framework is as follows:
- Step 1: preprocessing the text, word segmentation -> removal of stop words -> statistical selection of top N words as feature words
- Step 2: Generate ids for each keyword
- Step 3: Convert the text to a sequence of ids and complete the left side
- Step 4: Train set Shuffle
- Step 5: Embedding Layer converts words into word vectors
- Step 6: Add the model and construct the neural network structure
- Step 7: Train the model
- Step 8: Get the accuracy rate, recall rate and F1 value
Note that if TFIDF is used instead of word vector for document representation, the TFIDF matrix is generated directly after word segmentation and input to the model. In this paper, word vector and TFIDF are used to conduct experiments.
In history of zhihu teacher “zhuanlan.zhihu.com/p/34212945” summarizes the classification, text classification based on the deep learning has five main categories:
- Word embedding vectorization: Word2vec, FastText, etc
- Convolutional neural network feature extraction: TextCNN (convolutional neural network), char-CNN, etc
- Context mechanism: TextRNN (recurrent neural network), BiRNN, BiLSTM, RCNN, TextRCNN(TextRNN+CNN), etc
- Memory storage mechanism: EntNet, DMN, etc
- Attention mechanism: HAN, TextRNN+Attention, etc
Article recommended by Teacher Niu Yafeng:
- Text classification based on Word2vec and CNN: A review & Practice
2. Text classification based on random forest
This part mainly focuses on common text classification cases. Since random forest has good effect, this method is mainly shared. Specific steps include:
- Read the CSV Chinese text
- Call Jieba library to achieve Chinese word segmentation and data cleaning
- Feature extraction is represented by tF-IDF or Word2Vec word vector
- Classification based on machine learning
- Accuracy, recall rate, F value calculation and evaluation
1. Text classification
(1) Data set The data in this paper are the recent tourism review text of Huangguoshu Waterfall in Guizhou, from Dianping.com, with a total of 240 pieces of data, including 114 pieces of bad comments and 126 pieces of good comments, as shown in the figure below:
(2) Random forest text classification this paper no longer describes the code implementation process in detail, many previous articles have been introduced, and the source code has detailed notes for your reference.
# -*- coding:utf-8 -*- import csv import numpy as np import jieba import jieba.analyse from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the first step Read the file -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the file = "data. CSV" with the open (the file, the "r", encoding="UTF-8") as f: # csv.dictreader = csv.dictreader (f) labels = [] contents = [] for row in reader: # csv.dictreader = csv.dictreader (f) labels = [] contents = [] for row in reader: If row['label'] == 'label' : res = 0 else: Res = 1 allelage. append(res) content = row['content'] seglist = jieba.cut(content,cut_all=False) # Join (list(seglist)) #print(output) contents. Append (output) print(labels[:5]) print(contents[:5]) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the second step data preprocessing -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # converts text of the words in the word frequency matrix matrix element in a [I] [j] j said word under type I text word frequency Vectorizer = CountVectorizer() # This class will count tF-IDF weights for each word transformer = TfidfTransformer() # The first fit_transform is computed tF-IDF Tfidf = transformer. Fit_transform (vectorizer.fit_transform(contents)) for n in tfidf[:5]: Print (n) #tfidf = tfidf.astype(np.float32) print(type(tFIDf)) # Word = vectorizer.get_feature_names() for n in word[:5]: Print (n) print(" 字 数 :", len(word)) X = tfidf.toarray() print(x.shape) # use train_test_split to split the number of X y lists # X_train Number of y_train lists (one-to-one correspondence) -->> Number of X_test matrices -->> used to test the accuracy of the model X_train, X_test, y_train, Y_test = train_test_split(X, labels, test_size=0.3, Random_state = 1) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the third step of machine learning classification -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # # random forest classification model N_estimators: CLF = RandomForestClassifier(n_estimators=20) # Print (' format ':{}'. Format (clf.score(X_test, Pre = clf.predict(X_test) print(' pre :', pre[:10]) print(len(pre), len(y_test)) print(classification_report(y_test, pre))Copy the code
The output results are shown in the figure below. The average accuracy of random forest is 0.86, the recall rate is 0.86, and the F value is also 0.86.
2. Algorithm evaluation
Then the author tried to customize Precision, Recall and F-measure, which were calculated by the following formula:
Since this paper mainly focuses on the classification problem of 2, its experimental evaluation is mainly divided into 0 and 1. The complete code is as follows:
# -*- coding:utf-8 -*- import csv import numpy as np import jieba import jieba.analyse from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the first step Read the file -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the file = "data. CSV" with the open (the file, the "r", encoding="UTF-8") as f: # csv.dictreader = csv.dictreader (f) labels = [] contents = [] for row in reader: # csv.dictreader = csv.dictreader (f) labels = [] contents = [] for row in reader: If row['label'] == 'label' : res = 0 else: Res = 1 allelage. append(res) content = row['content'] seglist = jieba.cut(content,cut_all=False) # Join (list(seglist)) #print(output) contents. Append (output) print(labels[:5]) print(contents[:5]) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the second step data preprocessing -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # converts text of the words in the word frequency matrix matrix element in a [I] [j] j said word under type I text word frequency Vectorizer = CountVectorizer() # This class will count tF-IDF weights for each word transformer = TfidfTransformer() # The first fit_transform is computed tF-IDF Tfidf = transformer. Fit_transform (vectorizer.fit_transform(contents)) for n in tfidf[:5]: Print (n) #tfidf = tfidf.astype(np.float32) print(type(tFIDf)) # Word = vectorizer.get_feature_names() for n in word[:5]: Print (n) print(" 字 数 :", len(word)) X = tfidf.toarray() print(x.shape) # use train_test_split to split the number of X y lists # X_train Number of y_train lists (one-to-one correspondence) -->> Number of X_test matrices -->> used to test the accuracy of the model X_train, X_test, y_train, Y_test = train_test_split(X, labels, test_size=0.3, Random_state = 1) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the third step of machine learning classification -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # # random forest classification model N_estimators: CLF = RandomForestClassifier(n_estimators=20) # Print (' format ':{}'. Format (clf.score(X_test, Pre = clf.predict(X_test) print(' pre :', pre[:10]) print(len(pre), len(y_test)) print(classification_report(y_test, The pre) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the fourth step evaluation result -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- def classification_pj (name, y_test. pre): Print (" Algorithm evaluation :", Name) # accuracy = total number of correctly recognized individuals/total number of correctly recognized individuals # Recall rate = total number of correctly recognized individuals/total number of individuals in the test set # F value F-measure = accuracy * Recall rate * 2 / (correct rate + recall rate) YC_B, YC_G = 0,0 I =0 while I <len(pre): z= int(y_test[I]) # true y = int(pre[I]) # forecast if z==0: CZ_G += 1 else: CZ_B += 1 if y==0: YC_G += 1 else: YC_B += 1 if z==y and z==0 and y==0: ZQ_G += 1 elif z==y and z==1 and y==1: ZQ_B += 1 i = i + 1 print(ZQ_B, ZQ_G, YC_B, YC_G, CZ_B, Print ("Precision Good ", CZ_G) print("Precision Good ", Print ("Precision Bad 1:", P_B) R_G = ZQ_G * 1.0 / CZ_G R_B = ZQ_B * 1.0 / CZ_B print("Recall Good 0:", R_G) print("Recall Bad 1:", R_B) F_G = 2 * P_G * R_G / (P_G + R_G) F_B = 2 * P_B * R_B / (P_B + R_B) print("F-measure Good 0:", F_G) print(" f-measure Bad 1:", F_B) #Copy the code
The output results are shown in the figure below, in which the accuracy rate, recall rate and F value of positive comments are 0.9268, 0.9268 and 0.9268 respectively, while the accuracy rate, recall rate and F value of negative comments are 0.9032, 0.9032 and 0.9032 respectively.
3. Algorithm comparison
Finally, the author gives the text classification results of MACHINE learning RF, DTC, SVM, KNN, NB and LR, which are also common operations in writing papers.
# -*- coding:utf-8 -*- import csv import numpy as np import jieba import jieba.analyse from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import svm from sklearn import neighbors from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the first step Read the file -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the file = "data. CSV" with the open (the file, the "r", encoding="UTF-8") as f: # csv.dictreader = csv.dictreader (f) labels = [] contents = [] for row in reader: # csv.dictreader = csv.dictreader (f) labels = [] contents = [] for row in reader: If row['label'] == 'label' : res = 0 else: Res = 1 allelage. append(res) content = row['content'] seglist = jieba.cut(content,cut_all=False) # Join (list(seglist)) #print(output) contents. Append (output) print(labels[:5]) print(contents[:5]) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the second step data preprocessing -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # converts text of the words in the word frequency matrix matrix element in a [I] [j] j said word under type I text word frequency Vectorizer = CountVectorizer() # This class will count tF-IDF weights for each word transformer = TfidfTransformer() # The first fit_transform is computed tF-IDF Tfidf = transformer. Fit_transform (vectorizer.fit_transform(contents)) for n in tfidf[:5]: Print (n) #tfidf = tfidf.astype(np.float32) print(type(tFIDf)) # Word = vectorizer.get_feature_names() for n in word[:5]: Print (n) print(" 字 数 :", len(word)) X = tfidf.toarray() print(x.shape) # use train_test_split to split the number of X y lists # X_train Number of y_train lists (one-to-one correspondence) -->> Number of X_test matrices -->> used to test the accuracy of the model X_train, X_test, y_train, Y_test = train_test_split(X, labels, test_size=0.3, Random_state = 1) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the fourth step evaluation result -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- def classification_pj (name, y_test, pre): Print (" Algorithm evaluation :", Name) # accuracy = total number of correctly recognized individuals/total number of correctly recognized individuals # Recall rate = total number of correctly recognized individuals/total number of individuals in the test set # F value F-measure = accuracy * Recall rate * 2 / (correct rate + recall rate) YC_B, YC_G = 0,0 I =0 while I <len(pre): z= int(y_test[I]) # true y = int(pre[I]) # forecast if z==0: CZ_G += 1 else: CZ_B += 1 if y==0: YC_G += 1 else: YC_B += 1 if z==y and z==0 and y==0: ZQ_G += 1 elif z==y and z==1 and y==1: ZQ_B += 1 i = i + 1 print(ZQ_B, ZQ_G, YC_B, YC_G, CZ_B, Print ("Precision Good 0:{:.4f}". Format (P_G)) print("Precision Good 0:{:.4f}". Print (" Precision Bad 1: {: 4 f} ". The format (P_B) print (" Avg_precision: {: 4 f} ". The format (P_G + P_B) / 2) R_G = 1.0 / CZ_G ZQ_G * R_B = ZQ_B * 1.0 / CZ_B print (" Recall Good 0: {: 4 f} ". The format (R_G) print (" Recall Bad 1: {: 4 f} ". The format (R_B)) print("Avg_recall:{:.4f}".format((R_G+R_B)/2)) F_G = 2 * P_G * R_G / (P_G + R_G) F_B = 2 * P_B * R_B / (P_B + R_B) print("F-measure Good 0:{:.4f}".format(F_G)) print("F-measure Bad 1:{:.4f}".format(F_B)) Print (" Avg_fmeasure: {: 4 f} ". The format (F_G + F_B) / 2) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the third step Machine learning classification -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # random forest classification model rf = RandomForestClassifier (n_estimators = 20). The rf fit (X_train, Y_train) pre = rf. Predict (X_test) print(" classification_report ") print(y_test, pre)) classification_pj("RandomForest", y_test, Pre) print("\n") # DTC = DecisionTreeClassifier() DTC. Fit (X_train, Y_train) pre = DTC. Predict (X_test) print(" classification_report ") print(y_test, pre)) classification_pj("DecisionTree", y_test, Print ("\n") print("\n") # LinearSVC = LinearSVC() # LinearSVC svM. fit(X_train, Y_train) pre = SVM. Predict (X_test) print(" support vector machine ") print(classification_report(y_test, pre)) classification_pj("LinearSVC", y_test, Pre) print (" \ n ") # KNN = neighbors. The KNN classification model KNeighborsClassifier KNN () # n_neighbors = 11. The fit (X_train, Y_train) pre = knn.predict(X_test) print(" classification_report ") print(y_test, MultinomialNB() MultinomialNB() MultinomialNB. Y_train) pre = nb.predict(X_test) print(" classification_report ") print(y_test, pre)) classification_pj("MultinomialNB", y_test, Print ("\n") # listicRegression (solver='liblinear') LR. Fit (X_train, Y_train) pre = LR. Predict (X_test) print(" classification_report ") print(y_test, pre)) classification_pj("LogisticRegression", y_test, pre) print("\n")Copy the code
The output results are shown below, and it is found that the effect of Bayesian algorithm in text classification is still very good. At the same time, random forest, logistic regression, SVM effect is good.
The full results are as follows:
Random forest precision Recall F1-Score Support 0 0.92 0.88 0.90 41 1 0.85 0.90 0.88 31 accuracy 0.89 72 Macro AVG 0.89 0.89 Weighted AVG 0.89 0.89 0.89 72 Avg_precision:0.8858 RandomForest 28 36 33 39 31 41 Precision Good 0:0.9231 Precision Bad 1:0.8485 Avg_precision:0.8858 Recall Good 0:0.8780 Recall Bad 1:0.9032 Avg_recall:0.8906 F-Measure Good 0:0.9000 F-Measure Bad 1:0.8750 Avg_fmeasure:0.8875 Precision of decision tree classification Recall F1-score support 0 0.81 0.73 0.77 41 1 0.69 0.77 0.73 31 accuracy 0.75 72 Macro AVG 0.75 0.75 0.75 72 weighted Avg 0.76 0.75 0.75 72 Algorithm evaluation: DecisionTree 24 30 35 37 31 41 Precision Good 0:0.8108 Precision Bad 1:0.6857 Avg_precision:0.7483 Recall Good 0:0.7317 Recall Bad 1:0.7742 Avg_recall:0.7530 F-Measure Good 0:0.7692 F-Measure Bad 1:0.7273 Avg_fmeasure:0.7483 Support vector machine classification nearest neighbor classification Naive Bayes classification logistic regression classification......Copy the code
3. Text classification based on CNN
Then we started to implement text classification through CNN, which can be applied to many fields as long as there are data sets. Here is only the most basic and available methods and source code, I hope to help you.
1. Data preprocessing
In the last part I wrote machine learning text categorization, I already introduced Chinese word segmentation and other preprocessing operations, why do I need to introduce this part? Because here I’m going to add two new operations:
- Go to the stop
- The part of speech tagging
These two operations are very important in the process of text mining. On the one hand, it can improve our classification effect, on the other hand, it can filter out irrelevant feature words. Pos tagging can also assist us in other analysis, such as sentiment analysis and public opinion mining.
This part of the code is as follows:
# -*- coding:utf-8 -*- import csv import numpy as np import jieba import jieba.analyse import jieba.posseg as pseg from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the first step in data preprocessing -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the file = "data. CSV" # def get stop words stopwordslist(): Stopwords = [line.strip() for line in open('stop_words.txt', Encoding =" utF-8 ").readlines()] def deleteStop(sentence): encoding=" utF-8 ").readlines()] def deleteStop(sentence): stopwords = stopwordslist() outstr = "" for i in sentence: # print(i) if i not in stopwords and i! ="\n": outstr += I return outstr # 中文分词 Mat = [] with open(file, "r", encoding=" utf-8 ") as f: # csv.dictreader = csv.dictreader (f) labels = [] contents = [] for row in reader: # csv.dictreader = csv.dictreader (f) labels = [] contents = [] for row in reader: If row['label'] == 'label' : res = 0 else: Res = 1 allelage.append (res) #print(content) seglist = jieba.cut(content,cut_all=False) Seg_list = jieba.cut(STC,cut_all=False) output Join (list(seg_list)) #print(output) contents. Append (output) # pos = pseg. Cut (STC) seten = [] for word,flag in res: if flag not in ['nr','ns','nt','mz','m','f','ul','l','r','t']: Append (seten) print(labels[:5]) print(contents[:5]) print(Mat[:5]) # fileDic = open('wordCut.txt', 'w', encoding="UTF-8") for i in Mat: fileDic.write(" ".join(i)) fileDic.write('\n') fileDic.close() words = [line.strip().split(" ") for line in open('WordCut.txt',encoding='UTF-8').readlines()] print(words[:5])Copy the code
The running result is as shown in the figure below. It can be seen that the original text is segmented, and the stop words such as “still”, “, “and” often “are filtered out, and presented in two forms. Readers can conduct follow-up analysis according to their own needs. At the same time, write the segmented text to the wordCut. TXT file.
- Contents: Displays segmented sentences in the form of a list
- Mat: Displays word sequences that have been segmented and exist as lists
2. Feature extraction and Word2Vec word vector conversion
(1) Number of feature words
First of all, we called Tokenizer and FIT_on_texts functions to number each word in the text, and the higher the frequency of word occurrence, the smaller the number. As shown in the figure below, “waterfall”, “scenic spot”, “queue”, “water Curtain cave” and other special words appear more, pay attention to the space, “comment”, “pack up” can continue to filter out, add in the stop words table.
# FIT_on_texts function can number each word in the input text according to word frequency (the larger the word frequency, the smaller the number) tokenizer = tokenizer () tokenizer.fit_on_texts(Mat) VOCab = Tokenizer. word_index # print(vocab)Copy the code
The output result is as follows:
(2) Word2Vec word vector training
After obtaining the number of feature words, the header of the feature matrix is defined. Next, we need to convert each line of text into a one-dimensional word vector, and finally build the feature matrix for training and classification. Note that the PAD_SEQUENCES method is used to unify the length of CNN training for better training. For example, if a sentence is set to 100, words after it will be cut out. If the sentence does not exceed 100, a 0 is added to the front of the sentence, as shown below. At the same time, the classification result [0,1] indicates that the category standard is 0 favorable, and [1,0] indicates that the category standard is 1 bad.
The complete code is as follows:
X_train, X_test, y_train, y_test = train_test_split(Mat, labels, test_size=0.3) Random_state = 1) print (X_train [5]) print (y_train [5]) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - the third step Term vectors to build -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - # # Word2Vec training maxLen = 100 word sequence maximum length num_features min_word_count = 3 = 100 # set words vector dimensions Context = 4 # Set the size of the word context window # set the model = word2vec.word2vec (Mat, workers=num_workers, size=num_features, Min_count =min_word_count,window=context) # Force unit normalization model.init_sims(replace=True) # Enter a path to save the training model where the./data/model directory exists previously Save ("CNNw2vModel") model.wv.save_word2vec_format("CNNVector",binary=False) print(model) # Word2v_model = word2vec.word2vec.load ("CNNw2vModel") # trainID = tokenizer.texts_to_sequences(X_train) print(trainID) testID = tokenizer.texts_to_sequences(X_test) print(testID) # This method will unify the length of CNN training sequences trainSeq = PAD_SEQUENCES (trainID, Maxlen = maxlen) print(trainSeq) # Num_classes =2) # categorical print(y_test, num_classes=2) # categorical print(testCate)Copy the code
The following output is displayed:
[[' view ', ' ', 'area', 'too much', 'mature' and 'from' and 'big', 'waterfall', 'area', 'departure', 'area', 'sight-seeing cars',' good ', 'visitors',' and a half hours', 'the expo', 'the road', 'heads',' enjoy ', 'beauty', 'mood', 'sight-seeing cars',' on ', 'place', 'mark', 'destination', 'entrance', 'guide', 'go,' success ', 'fog', 'on', 'ask', 'driver', 'to', 'driver', 'vaguely', 'said', 'open', 'area', 'bus' and' seven ', 'area' and 'development', 'perfect', 'up' and 'comments'], [the' off-season ', 'waterfall', 'people', 'less', 'the jingmei', 'ticket', 'cheap', 'worth it', 'to'], [' waterfall ', 'experience', 'bad', 'five-star', 'good', 'whole' and 'is',' paint ', 'road' and 'narrow', 'cause', 'large', 'jam', 'line', 'collapse', 'area', 'guidelines' and' clear 'and' line ', 'rain', 'the rain, 'design', ' ', 'adult', 'child', 'the old man', 'in the rain', 'area', 'reception', 'poor', 'waterfall', 'true', 'nominal' and 'seven', 'up' and 'comments'], [' dad', 'points',' waterfall ', 'waterfall', 'waterfall', ' ', 'waterfall', 'tickets',' 'anyway,' more than ' ' ' 'to', 'with', 'told', 'only', 'a', 'into' and 'mouth', 'to', 'high speed', 'export' and 'straight', 'go back' and 'down', 'instructions' and' clear 'and' isolation ', 'rails',' cars' and 'guide into the', 'parking', 'parking', 'fee', 'and', 'time', ', 'parking', 'as' and' area ', 'tickets',' single ', 'including' and 'transportation', 'fee', 'transport', 'to' and 'extra', 'from the', 'about', 'road', 'spend', 'not' and 'minutes',' fee ', 'true', 'accept', ', 'the family', 'don't want to', '┐', '(',' ─ ', '(',' ─ ', ') ', '┌', 'interest', 'in', 'intense' and 'higher fees',' a king ', 'waterfall', 'good', 'bad', 'review', ' 'and' image ', 'not' and 'development', 'waterfall', 'hole', 'waterfall', 'grand', 'grand', 'a', 'natural', 'area', 'inflation', 'a', 'up' and 'comments'], [' family', 'ticket', 'residents' and' exclusive ', 'offer' and 'ticket']] of [1, 0, 1, 1, 1] Word2Vec (vocab = 718, size = 100, Alpha =0.025) [[0 0 0... 2481 5 4] [0 0 0... 570 52 90] [0 0 0... 187 5 4]... [0 0 0... 93 5 4] [0 0 0... 30 5 4] [0 0 0... 81 18 78]] [[0. 1.] [1. 0.] [0. 1.] [0. 1.] [0. 1.] [0. 1.] [j] 1. 0.Copy the code
3. The CNN building
Then we began to train the constructed feature matrix and calculate the similarity of different texts or one-dimensional matrices, so that different sentences with good and bad reviews could be divided into two categories according to the similarity. Here also use Word2Vec to achieve the core code as follows:
model = word2vec.Word2Vec(
Mat,
workers=num_workers,
size=num_features,
min_count=min_word_count,
window=context
);
Copy the code
The result of the training model is “Word2Vec(VOCab =718, size=100, alpha=0.025)”. The filtering frequency set here is 3, which is equivalent to the filtering frequency less than 3, and finally 718 feature words are obtained. Num_features has a value of 100, indicating a 100-dimensional word vector. Sg defaults to a continuous word bag model, or can be set to a 1 hop model. The default optimization method is negative sampling, more parameter explanation please read Baidu.
Refer to the author above:
- Gensim word vector Word2Vec installation and Similarity Calculation of Chinese short text in Qing Years
If we have a training set, a test set, and if the test set does not have a particular word, how do we solve it? Here we use try-except exception capture when we get the word vector of a particular word and convert it into a training matrix. If the word is not found, we can skip it and it will automatically fill 0.
This part of the code is as follows:
# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the fourth step build (CNN) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- # use the Word2vec custom after training, training matrix Embedding Embedding_matrix = np.zeros((len(VOCab)+1, 100)) # count from 0 plus 1 corresponding to the previous feature for word, I in vocab.items(): Embedding_vector = w2v_model[STR (word)] embedding_matrix[I] = embedding_vector except KeyError: Continue # main_input = Input(shape=(maxLen,)), Embedder = Embedding(Len (VOCab)+1, 100, embedder = Embedding(Len (VOCab)+1, 100) input_length=maxLen, weights=[embedding_matrix], Trainable =False # build Embedding model = Sequential() model. Add (embedder) # build Embedding model. Add (Conv1D(256, 3, 4) Add (MaxPool1D(maxlen-5, 3, PADDING ='same')) # pool layer model.add(Conv1D(32, 3, padding='same', Activation ='relu')) # Convolution layer model.add(Flatten()) # Flattening model.add(Dropout(0.3)) # Prevent overfitting 30% do not train model.add(Dense(256, Activation ='relu')) # Fully connected layer model.add(Dropout(0.2)) # Prevent overfitting model.add(Dense(units=2, Activation ='softmax') # Compile (optimizer =' Adam ') # compile(optimizer =' Adam ', # optimizer loss = 'categorical_crossentropy', # Metrics = ['accuracy'] # calculation error or accuracy) # Training (training data, training class standard, batch-size 256 training, EPOchs, random selection, validation set 20%) history = model.fit(trainSeq, trainCate, batch_size=256, epochs=6, Validation_split = 0.2) model. The save (" TextCNN ") # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the fifth prediction model -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- #print(result) #print(result) print(np.argmax(result,axis=1)) score = mainModel.evaluate(testSeq, testCate, batch_size=32) print(score)Copy the code
The constructed model is as follows:
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 100, 100) 290400 _________________________________________________________________ conv1d_1 (Conv1D) (None, 100, 256) 77056 _________________________________________________________________ max_pooling1d_1 (MaxPooling1 (None, 34, 256) 0 _________________________________________________________________ conv1d_2 (Conv1D) (None, 34, 32) 24608 _________________________________________________________________ flatten_1 (Flatten) (None, 1088) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 1088) 0 _________________________________________________________________ dense_1 (Dense) (None, 256) 278784 _________________________________________________________________ dropout_2 (Dropout) (None, 256) 0 _________________________________________________________________ dense_2 (Dense) (None, 2) 514 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Total params: 671362 Trainable params: 380962 Non - trainable params: 290400Copy the code
The output results are shown in the figure below. The prediction result of this model is not very ideal, the accuracy value is only 0.625. Why? The author is also further studying the optimization of the depth model. The more important thing in this paper is to provide a usable method. Please forgive me if the effect is not good
4. Test visualization
Finally, add the visual code and draw the graph as shown below. Again, the effect of this algorithm is not ideal, the error is not gradually decreasing, the accuracy is not constantly increasing. If readers find reasons or optimization methods, please inform us, thank you.
Finally, attach the complete code:
# -*- coding:utf-8 -*-
import csv
import numpy as np
import jieba
import jieba.analyse
import jieba.posseg as pseg
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from keras import models
from keras import layers
from keras import Input
from gensim.models import word2vec
from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.models import Sequential
from keras.models import load_model
from keras.layers import Flatten, Dense, Dropout, Conv1D, MaxPool1D, Embedding
#----------------------------------第一步 数据预处理--------------------------------
file = "data.csv"
# 获取停用词
def stopwordslist(): #加载停用词表
stopwords = [line.strip() for line in open('stop_words.txt', encoding="UTF-8").readlines()]
return stopwords
# 去除停用词
def deleteStop(sentence):
stopwords = stopwordslist()
outstr = ""
for i in sentence:
# print(i)
if i not in stopwords and i!="\n":
outstr += i
return outstr
# 中文分词
Mat = []
with open(file, "r", encoding="UTF-8") as f:
# 使用csv.DictReader读取文件中的信息
reader = csv.DictReader(f)
labels = []
contents = []
for row in reader:
# 数据元素获取
if row['label'] == '好评':
res = 0
else:
res = 1
labels.append(res)
# 中文分词
content = row['content']
#print(content)
seglist = jieba.cut(content,cut_all=False) #精确模式
#print(seglist)
# 去停用词
stc = deleteStop(seglist) #注意此时句子无空格
# 空格拼接
seg_list = jieba.cut(stc,cut_all=False)
output = ' '.join(list(seg_list))
#print(output)
contents.append(output)
# 词性标注
res = pseg.cut(stc)
seten = []
for word,flag in res:
if flag not in ['nr','ns','nt','mz','m','f','ul','l','r','t']:
#print(word,flag)
seten.append(word)
Mat.append(seten)
print(labels[:5])
print(contents[:5])
print(Mat[:5])
#----------------------------------第二步 特征编号--------------------------------
# fit_on_texts函数可以将输入的文本每个词编号 编号根据词频(词频越大编号越小)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(Mat)
vocab = tokenizer.word_index #停用词已过滤,获取每个词的编号
print(vocab)
# 使用 train_test_split 分割 X y 列表
X_train, X_test, y_train, y_test = train_test_split(Mat, labels, test_size=0.3, random_state=1)
print(X_train[:5])
print(y_train[:5])
#----------------------------------第三步 词向量构建--------------------------------
# Word2Vec训练
maxLen = 100 #词序列最大长度
num_features = 100 #设置词语向量维度
min_word_count = 3 #保证被考虑词语的最低频度
num_workers = 4 #设置并行化训练使用CPU计算核心数量
context = 4 #设置词语上下文窗口大小
# 设置模型
model = word2vec.Word2Vec(Mat, workers=num_workers, size=num_features,
min_count=min_word_count,window=context)
# 强制单位归一化
model.init_sims(replace=True)
# 输入一个路径保存训练模型 其中./data/model目录事先存在
model.save("CNNw2vModel")
model.wv.save_word2vec_format("CNNVector",binary=False)
print(model)
# 加载模型 如果word2vec已训练好直接用下面语句
w2v_model = word2vec.Word2Vec.load("CNNw2vModel")
# 特征编号(不足的前面补0)
trainID = tokenizer.texts_to_sequences(X_train)
print(trainID)
testID = tokenizer.texts_to_sequences(X_test)
print(testID)
# 该方法会让CNN训练的长度统一
trainSeq = pad_sequences(trainID, maxlen=maxLen)
print(trainSeq)
testSeq = pad_sequences(testID, maxlen=maxLen)
print(testSeq)
# 标签独热编码 转换为one-hot编码
trainCate = to_categorical(y_train, num_classes=2) #二分类问题
print(trainCate)
testCate = to_categorical(y_test, num_classes=2) #二分类问题
print(testCate)
#----------------------------------第四步 CNN构建--------------------------------
# 利用训练后的Word2vec自定义Embedding的训练矩阵 每行代表一个词(结合独热编码和矩阵乘法理解)
embedding_matrix = np.zeros((len(vocab)+1, 100)) #从0开始计数 加1对应之前特征词
for word, i in vocab.items():
try:
#提取词向量并放置训练矩阵
embedding_vector = w2v_model[str(word)]
embedding_matrix[i] = embedding_vector
except KeyError: #单词未找到跳过
continue
# 训练模型
main_input = Input(shape=(maxLen,), dtype='float64')
# 词嵌入 使用预训练Word2Vec的词向量 自定义权重矩阵 100是输出词向量维度
embedder = Embedding(len(vocab)+1, 100, input_length=maxLen,
weights=[embedding_matrix], trainable=False) #不再训练
# 建立模型
model = Sequential()
model.add(embedder) #构建Embedding层
model.add(Conv1D(256, 3, padding='same', activation='relu')) #卷积层步幅3
model.add(MaxPool1D(maxLen-5, 3, padding='same')) #池化层
model.add(Conv1D(32, 3, padding='same', activation='relu')) #卷积层
model.add(Flatten()) #拉直化
model.add(Dropout(0.3)) #防止过拟合 30%不训练
model.add(Dense(256, activation='relu')) #全连接层
model.add(Dropout(0.2)) #防止过拟合
model.add(Dense(units=2, activation='softmax')) #输出层
# 模型可视化
model.summary()
# 激活神经网络
model.compile(optimizer = 'adam', #优化器
loss = 'categorical_crossentropy', #损失
metrics = ['accuracy'] #计算误差或准确率
)
#训练(训练数据、训练类标、batch—size每次256条训练、epochs、随机选择、验证集20%)
history = model.fit(trainSeq, trainCate, batch_size=256,
epochs=6, validation_split=0.2)
model.save("TextCNN")
#----------------------------------第五步 预测模型--------------------------------
# 预测与评估
mainModel = load_model("TextCNN")
result = mainModel.predict(testSeq) #测试样本
print(result)
print(np.argmax(result,axis=1))
score = mainModel.evaluate(testSeq,
testCate,
batch_size=32)
print(score)
#----------------------------------第五步 可视化--------------------------------
import matplotlib.pyplot as plt
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train','Valid'], loc='upper left')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train','Valid'], loc='upper left')
plt.show()
Copy the code
4. To summarize
In short, this paper realizes a CNN text classification learning case through Keras, and introduces in detail the principle knowledge of text classification and its comparison with machine learning.
Click to follow, the first time to learn about Huawei cloud fresh technology ~