♚ \
Pseudo literary Boy, Python, R, Java enthusiast, loves novelty, a simple, pure IT guy.
Blog: blog.csdn.net/striver6
The author of this article has joined the Python Chinese Community Columnist Program
Chapter one: Mining the hot song list of netease Cloud Music with Python
** User recommendation System based on netease Cloud Music Review (text processing)
Sklearn version of Python
1.1 sklearn
Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-meansand DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.Copy the code
1.2 Text preprocessing
Import previously processed data:
Import OS 2.os.chdir('G:\\ item \\ netease cloud Music review \\ text mining ') 3.import pandas as pd 4.data = pd.read_csv(r"DealtedData.csv",encoding='gbk', sep=',',index_col=0,header=0)Copy the code
Word segmentation:
1.import jieba
2.def chinese_word_cut(mytext):
3. return " ".join(jieba.cut(mytext))
4.d=pd.DataFrame(data['context'].astype(str))
5.d["title_cutted"] = d['context'].apply(chinese_word_cut)
6.d.title_cutted.head()
Copy the code
The analysis results are as follows:
2.1724358802 Listen to your beloved Danny every day 3.451250610 Yeah, you've never liked me. 6.Name: title_cutted, dtype: objectCopy the code
Then we vectorize these papers. The so-called text vectorization means forming a 0-1 matrix of 28197 (the number of documents) * N (the number of all words in the text). The occurrence of a particular word in this document is counted as 1, otherwise, it is 0. If all the words are selected, it must be a very large matrix. Therefore, 1000 keywords are selected from all the words in the previous operation.
1.from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
2.vectorizer = CountVectorizer()
3.count = vectorizer.fit_transform(d.title_cutted)
4.print(vectorizer.get_feature_names())
Copy the code
View all text keywords
1. [... 'not' and 'not intentionally', 'not' and 'not' and 'not' and 'not', 'up' and 'unexpected', 'not', 'don't dye', 'don't owe', 'unhappy', 'with a', 'not' and 'not' and 'die', 'yes', 'no', 'don't float', 'didn't slip', 'dissatisfaction', 'don't wave', 'don't', 'don't bother', 'or', 'not dry', 'love', 'ignore', 'not' and 'not to be outdone,' don't born ', 'no', 'don't mention it,' I 'and' automatic ', 'no', 'don't hurt her' and 'pain', 'no harm', 'don't', 'never sleeps',' know ', 'before you know it', 'unappreciative', 'not' and 'lost', 'I don't know the so-called', ' ', 'not' and 'not', 'nothing', 'no matter, 'anyway,' impure ', 'you', ' 'no,' freedom ', 'old', 'impatient', 'not' and 'countless', 'not' and 'not' and 'not' and 'can't cure', 'his',' not 'and' not ', 'not' 'give up', 'don't bitter', 'don't blue', 'no', 'long', 'don't', 'shameless', 'no', 'whatever' and 'no', 'don't let', 'don't remember,' unreasonable ', 'no', 'whatever', 'no matter what', 'unappreciative', 'not', 'not' and 'not', 'no thanks',' not ', 'not', 'not' and 'not' and 'managers',' to ', 'runaway', 'but', 'not' and 'miles',' not 'and' no ', 'not' and 'good', 'long', 'don't ask, don' 'continuous',' don't need to be ', 'no', 'not' and 'ignoring', 'low', 'instead of', 'rather than', 'or', 'and son xielao', 'I have nothing to do with', 'natural', 'ugly duckling,' single 'and' professional ', 'course', 'special' and 'focus', 'special' and 'album', 'special', 'the world', 'nothing', 'the world', 'secular', 'literature', 'the one like you', 'world', 'World Cup', 'world view', 's', 'the veterans',' the world ', 'raw', 'Tokyo', 'the northeast, southeast Asia, 'Oriental', 'Oriental', 'Oriental', 'the east' and 'dongdu', 'east huang zhong', 'things',' the east evil west poison ', 'dong' and 'smooth' and 'lost' and 'lost', 'losing face', 'two or three times',' two 'and' two ', 'two people with', 'two', 'both' and 'two' and 'double', 'two minutes',' two thousand ', 'two thousand', 'two', '2', 'two' and 'two weeks',' 2 ', 'two dollars',' two days', 'two and a half years',' two years', 'two of a kind', 'two months', 'two', 'two' and 'two' and 'two', '2', 'two hundred days',' two hundred jins, 'two boxes',' different ', 'two' and 'two seconds',' two-tier ', 'two lines',' two 'and' two ', 'two sides',' two 'and' two songs', 'tough', 'serious', 'serious',' serious' and 'loss' and 'all', 'personal', 'the individual feels,' personal views', 'individual', 'size', 'child', 'a coin', 'character' and 'a', 'girl', 'ya points' and' in ', 'Chinese medicine', 'lunch', 'the', 'culture', 'the Chinese folk songs',' China 'and' Hong Kong 'and' central ', 'the central air conditioning,' secondary school ', 'secondary school', 'middle', 'sun', 'island beautiful snow', 'in the', 'CGNPC', 'the temple', 'Chinese', 'center', 'like', In the Chinese department of 'Chinese', ' ', 'song', 'in', 'tears',' central ', "Mid-Autumn festival", "Mid-Autumn festival", 'the old', 'tests',' a ', 'the way', 'middle', 'Alt', 'fengcheng city', 'the window', 'near', 'Danny', 'Danny', 'Danny', 'to' and 'why', 'why' and 'first', 'for you', 'why', 'win honor for our countries',' a few ', 'so far' 'to', 'to live', 'for kiki', 'for Luo Qiqi', 'for girls',' difficult ', 'director', 'active', 'Vocalist ',' lead singer ', 'host ',' main time ', 'mainstream '...]Copy the code
View all text keywords and their locations
1.print(vectorizer.vocabulary_) 2.{... 'story: 6694,' every day ': 7759,' love ', 10103, "Danny" : 252, 'never', 1885, 'like', 3777, 'a' : 803, 'kerosene: 8436,' your ': 7880,' glove ': Of 5636, 8233 'tender' : 'recently' : 7194, 'song' : 11920, 'today', 1863, 'classics' : 8383, :' why '2226,' no: 7941, 'resistance' : 6369, 'a' : 787, 'we', 6052, 'miss', 6630, 'fear', 4776, 'movie' : 8799, 'scenario: 5858,' the same ': 779,' miss ', 11479, 'so: 6182,' always' : 826, 'look at' : 9095, 3350 'but' : 'go back' : 3848, 'look at' : 828, 'precious' : 8683,' helpless ': 4376,' treasure: 8680, 'Wang Leilei: 8642,' this is: 11085, 'why' : 1453, 8115 'romantic' : 'when I was a child, 4912,' a ', 7208, 'radio' : 6650, 'happy' : 4462, 'sleep' : 9183, 'to', 10111, 'put down' : 6662, 'good' : Of 4444, 11063 'these' : 'song' : 7650, the "classics" : 9647, 'kiss', 1735, 'and' : 9789, 'Mrs' : 4345,' love ', 8485, 'time: 6969,' JiGuan ': Of 9594, 716 'a' : 'eating', 3397, 'shop', 5259, 'know', 9202, 'what', 1842, 'a', 607, 'love', 8482, 'happiness', 5221, 'nan' : 444, 'super' : 10823, 907 'song' : 'praised: 4234,' may ': 3361,' one hand: 3036, "ferrari" : 7987, ' 'a lot: 4104,' laughed: 4222, 'live' : 8045, 'goodbye' : 2488, big Slag: 4213... }Copy the code
View the results of the word frequency matrix:
1. [[0 0 0. 0 0 0] 2. [0 0 0. 0 0 0] 3 [0 0 0. 0 0 0] 4... 5. [0 0 0. 0 0 0] 6. [0 0 0... [0 0 0... 0 0]]Copy the code
Zeros fill the matrix. Look at the dimensions:
1.count.shape
2.(20523, 12111)
Copy the code
20,523 rows, 12,111 columns, that’s 20,523 documents, user comments, 12,111 words in the corpus. Check its data type:
1.type(count)
2.scipy.sparse.csr.csr_matrix
Copy the code
Okay sparse matrix.
1.3 Calculation of TF-IDF value:
Count tF-IDF weights for each term in CountVectorizer using the TfidfTransformer function:
1.tfidf_vec = TfidfVectorizer()
2.tfidf_matrix = tfidf_vec.fit_transform(d.title_cutted)
3.print(tfidf_matrix.toarray())
Copy the code
Tf-idf array:
1. [[0. 0 0.... 0. 0. 0.] 2. [0. 0. 0.... 0. 0. 0.] 3 [0. 0. 0.... 0. 0. 0.] 4... 5. [0. 0. 0.... 6. 0. 0. 0.] [0. 0. 0.... 0. 0. 0.] 7. [0. 0. 0.... 0. 0. 0.]]Copy the code
The remaining steps are similar to above. Find corresponding functions to obtain user similarity.
Ii. Recommendation of Similar Users (R language Version)
2.1 Reading database
Set the working directory and import the function package:
1. Setwd (" G: \ \ \ \ \ \ year one course regression analysis ") 2. The library (xml2, lib = "G: \ \ \ \ \ R \ R language language learning the installation package") 3. The library (Rcpp, lib = "G: \ \ \ \ \ R \ R language language learning the installation package") 4. The library (slam, lib = "G: \ \ \ \ \ R \ R language language learning the installation package") 5. The library (NLP, lib = "G: \ \ \ \ \ R \ R language language learning the installation package") Library (tm,lib="G:\\R language \\R language learning \\ install package ") 7. Package ("rJava",lib="G:\\R \\R \ learning \\ install package ") 9. Library (rJava,lib="G:\\R \ learning \R \ install package ") 11. Library (Rwordseg,lib="G:\\R language \\R language learning \\ install package ")Copy the code
Import data, wherein, clipboard refers to huxiou. TXT is copied after opening with Notepad to prevent Chinese garbled characters
1.csv <- read.table("clipboard",header=T, stringsAsFactors=F,quote = "",encoding="utf-8")
2.mystopwords<-unlist(read.table("StopWords.txt",stringsAsFactors=F,quote = ""))
3.head(csv)
4.dim(csv)
5.colnames(csv)<-c("text")
Copy the code
2.2 Defining data preprocessing functions
Remove numeric function
1. RemoveNumbers = function(x) {ret = gsub("[0-90123456789]","",x)}Copy the code
SegmentCN word segmentation function RmMSeg4J and RSmartCN can also be used for Chinese word segmentation
1.wordsegment<- function(x) {
2. library(Rwordseg)
3. segmentCN(x)
4.}
Copy the code
Remove the stop word function
1.removeStopWords = function(x,words) {
2. ret = character(0)
3. index <- 1
4. it_max <- length(x)
5. while (index <= it_max) {
6. if (length(words[words==x[index]]) <1) ret <- c(ret,x[index])
7. index <- index +1
8. }
9. ret
10.}
Copy the code
2.3 Removing A Number
1.sample.words <- lapply(data[,1], removeNumbers)
2.dim(as.matrix(sample.words))
3.head(sample.words)
Copy the code
2.4 Chinese word segmentation
1.sample.words <- lapply(sample.words, wordsegment)
2.dim(as.matrix(sample.words))
3.sample.words[1:6]
Copy the code
2.5 Remove the stop word
Process Chinese word segmentation first and then stopWords to prevent global replacement of lost information.
1.sample.words <- lapply(sample.words, removeStopWords, mystopwords)
2.head(sample.words)
3.text<-sample.words[,1]
4.colnames(sample.words)<-c("text")
5.write.csv(as.matrix(sample.words),"delateddata.csv")
Copy the code
2.6 Corpus construction
1.corpus = Corpus(VectorSource(sample.words))
2.meta(corpus,"cluster") <- csv$type
3.unique_type <- unique(csv$type)
4.corpus
Copy the code
2.7 Establish document-term matrix
1.(sample.dtm <- DocumentTermMatrix(corpus, control = list(wordLengths = c(2, Inf))))
Copy the code
The next step is to calculate tF-IDF value and similarity, obtain similar users, and complete user recommendation.
zhi
支
chi
持
zuo
作
zhe
者
Long click scan code to encourage the author
\
Use OpenCV and OCR to identify the table data in the image
Click to become a registered member of the community ** “Watching” **