demand
Using supervised learning to train historical data to generate models for predicting text categories.
Sample cleaning
Delete duplicate data, correct or delete invalid data, and check data consistency. For example, I consider data with length less than 13 to be invalid and delete it.
def writeFile(text):
file_object = open('result.txt'.'w')
file_object.write(text)
file_object.close()
def clear():
text = ""
file_obj = open("deal.txt")
list_of_lines = file_obj.readlines()
for line in list_of_lines:
if(len(line)>13):
text += line
writeFile(text)
file_obj.close()Copy the code
Define the set of categories
Make manual classification according to the sample set, such as divided into the following categories:
Serial number | category |
---|---|
1 | Environmental protection |
2 | The traffic |
3 | Mobile phone |
4 | The law |
5 | The car |
Classification thesaurus
Feature extraction involves text segmentation, sogou pinyin.sogou.com/dict/ can search a variety of…
Commonly used algorithm
- Naive Bayes
- Rocchio
- SVM
- KNN
- The decision tree
- The neural network
SVM is chosen here. In essence, SVM is a special two-layer neural network with efficient learning algorithm.
Feature set
When using SVM classification, one of the important tasks is to determine the feature set. Only when the feature set is determined, can we proceed with the calculation. Then how to determine the feature set? The general approach can be to extract all the words of the sample as a feature set. For example, we have two text “elementary school” and “stock crash,” is the feature set {” primary school “, “school”, “stock”, “fall”}.
Feature weighting
The determination of feature set can be regarded as the dimension of vector, while for each sample, the value of each dimension needs to be determined, which can be regarded as the weight of feature, and TF-IDF is often used as the value. What is tF-IDF? To put it simply, TF is the number of occurrences of a certain term in a document, while IDF is the frequency of inverse document, which can be calculated by the following formula:
Feature dimension reduction
When there are more and more statistical samples and each sample is relatively large, the feature dimension may be particularly large. So you might want to reduce the dimension of the feature set. Feature reduction is the removal of dimensions that have little impact to avoid dimensional disaster. Have more treatment, such as can be directly define a meaningless word library will some meaningless word to get rid of, or in word as a basis for selecting representative of words, or by any other algorithm to extract the several words as a representative words, or the calibration algorithm choose representative with classic card words, the above method can achieve dimension reduction effect.
code
Machine learning libraries are many, you can choose a familiar and famous library to achieve, the key code is as follows:
Double [][] samples = all samples and weight array int labelInt[] = SVM<double[]> SVM =new SVM<double[]>(new LinearKernel(), 1.0.12, SVM.Multiclass.ONE_VS_ALL); svm.learn(samples, labels); svm.finish(); Double [] test = SVM. Predict (x)Copy the code
parameter
There are mainly two SVM parameters to choose: kernel function and penalty factor. The main kernel functions include RBF kernel, linear kernel, polynomial kernel and Sigmoid kernel, and linear kernel is generally selected in text classification. The penalty factor is used to punish the misclassified samples. The larger the penalty factor is, the more attention is paid to the loss and the larger the penalty factor is, it can always make all samples correctly classified eventually, but there may be over-fitting, which affects the subsequent generalization ability.
==== commercials can be skipped directly at ====
My new book “Analysis of Tomcat kernel Design” has been pre-sold in JINGdong, friends in need can go to item.jd.com/12185360.ht… Make a reservation. Thank you all.
= = = = = = = = = = = = = = = = = = = = = = = = =
Welcome to: