♚ \

Pseudo literary Boy, Python, R, Java enthusiast, loves novelty, a simple, pure IT guy.

Blog: blog.csdn.net/striver6

The author of this article has joined the Python Chinese Community Columnist Program

Chapter one: Mining the hot song list of netease Cloud Music with Python

** User recommendation System based on netease Cloud Music Review (text processing)

Sklearn version of Python

1.1 sklearn

Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language.It  features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-meansand DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.Copy the code

1.2 Text preprocessing

Import previously processed data:

Import OS 2.os.chdir('G:\\ item \\ netease cloud Music review \\ text mining ') 3.import pandas as pd 4.data = pd.read_csv(r"DealtedData.csv",encoding='gbk', sep=',',index_col=0,header=0)Copy the code

Word segmentation:

1.import jieba  
2.def chinese_word_cut(mytext):  
3.    return " ".join(jieba.cut(mytext))  
4.d=pd.DataFrame(data['context'].astype(str))  
5.d["title_cutted"] = d['context'].apply(chinese_word_cut)  
6.d.title_cutted.head()  
Copy the code

The analysis results are as follows:

2.1724358802 Listen to your beloved Danny every day 3.451250610 Yeah, you've never liked me. 6.Name: title_cutted, dtype: objectCopy the code

Then we vectorize these papers. The so-called text vectorization means forming a 0-1 matrix of 28197 (the number of documents) * N (the number of all words in the text). The occurrence of a particular word in this document is counted as 1, otherwise, it is 0. If all the words are selected, it must be a very large matrix. Therefore, 1000 keywords are selected from all the words in the previous operation.

1.from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer  
2.vectorizer = CountVectorizer()  
3.count = vectorizer.fit_transform(d.title_cutted)  
4.print(vectorizer.get_feature_names())    
Copy the code

View all text keywords

1. [... 'not' and 'not intentionally', 'not' and 'not' and 'not' and 'not', 'up' and 'unexpected', 'not', 'don't dye', 'don't owe', 'unhappy', 'with a', 'not' and 'not' and 'die', 'yes', 'no', 'don't float', 'didn't slip', 'dissatisfaction', 'don't wave', 'don't', 'don't bother', 'or', 'not dry', 'love', 'ignore', 'not' and 'not to be outdone,' don't born ', 'no', 'don't mention it,' I 'and' automatic ', 'no', 'don't hurt her' and 'pain', 'no harm', 'don't', 'never sleeps',' know ', 'before you know it', 'unappreciative', 'not' and 'lost', 'I don't know the so-called', ' ', 'not' and 'not', 'nothing', 'no matter, 'anyway,' impure ', 'you', ' 'no,' freedom ', 'old', 'impatient', 'not' and 'countless', 'not' and 'not' and 'not' and 'can't cure', 'his',' not 'and' not ', 'not' 'give up', 'don't bitter', 'don't blue', 'no', 'long', 'don't', 'shameless', 'no', 'whatever' and 'no', 'don't let', 'don't remember,' unreasonable ', 'no', 'whatever', 'no matter what', 'unappreciative', 'not', 'not' and 'not', 'no thanks',' not ', 'not', 'not' and 'not' and 'managers',' to ', 'runaway', 'but', 'not' and 'miles',' not 'and' no ', 'not' and 'good', 'long', 'don't ask, don' 'continuous',' don't need to be ', 'no', 'not' and 'ignoring', 'low', 'instead of', 'rather than', 'or', 'and son xielao', 'I have nothing to do with', 'natural', 'ugly duckling,' single 'and' professional ', 'course', 'special' and 'focus', 'special' and 'album', 'special', 'the world', 'nothing', 'the world', 'secular', 'literature', 'the one like you', 'world', 'World Cup', 'world view', 's', 'the veterans',' the world ', 'raw', 'Tokyo', 'the northeast, southeast Asia, 'Oriental', 'Oriental', 'Oriental', 'the east' and 'dongdu', 'east huang zhong', 'things',' the east evil west poison ', 'dong' and 'smooth' and 'lost' and 'lost', 'losing face', 'two or three times',' two 'and' two ', 'two people with', 'two', 'both' and 'two' and 'double', 'two minutes',' two thousand ', 'two thousand', 'two', '2', 'two' and 'two weeks',' 2 ', 'two dollars',' two days', 'two and a half years',' two years', 'two of a kind', 'two months', 'two', 'two' and 'two' and 'two', '2', 'two hundred days',' two hundred jins, 'two boxes',' different ', 'two' and 'two seconds',' two-tier ', 'two lines',' two 'and' two ', 'two sides',' two 'and' two songs', 'tough', 'serious', 'serious',' serious' and 'loss' and 'all', 'personal', 'the individual feels,' personal views', 'individual', 'size', 'child', 'a coin', 'character' and 'a', 'girl', 'ya points' and' in ', 'Chinese medicine', 'lunch', 'the', 'culture', 'the Chinese folk songs',' China 'and' Hong Kong 'and' central ', 'the central air conditioning,' secondary school ', 'secondary school', 'middle', 'sun', 'island beautiful snow', 'in the', 'CGNPC', 'the temple', 'Chinese', 'center', 'like', In the Chinese department of 'Chinese', ' ', 'song', 'in', 'tears',' central ', "Mid-Autumn festival", "Mid-Autumn festival", 'the old', 'tests',' a ', 'the way', 'middle', 'Alt', 'fengcheng city', 'the window', 'near', 'Danny', 'Danny', 'Danny', 'to' and 'why', 'why' and 'first', 'for you', 'why', 'win honor for our countries',' a few ', 'so far' 'to', 'to live', 'for kiki', 'for Luo Qiqi', 'for girls',' difficult ', 'director', 'active', 'Vocalist ',' lead singer ', 'host ',' main time ', 'mainstream '...]Copy the code

View all text keywords and their locations

1.print(vectorizer.vocabulary_) 2.{... 'story: 6694,' every day ': 7759,' love ', 10103, "Danny" : 252, 'never', 1885, 'like', 3777, 'a' : 803, 'kerosene: 8436,' your ': 7880,' glove ': Of 5636, 8233 'tender' : 'recently' : 7194, 'song' : 11920, 'today', 1863, 'classics' : 8383, :' why '2226,' no: 7941, 'resistance' : 6369, 'a' : 787, 'we', 6052, 'miss', 6630, 'fear', 4776, 'movie' : 8799, 'scenario: 5858,' the same ': 779,' miss ', 11479, 'so: 6182,' always' : 826, 'look at' : 9095, 3350 'but' : 'go back' : 3848, 'look at' : 828, 'precious' : 8683,' helpless ': 4376,' treasure: 8680, 'Wang Leilei: 8642,' this is: 11085, 'why' : 1453, 8115 'romantic' : 'when I was a child, 4912,' a ', 7208, 'radio' : 6650, 'happy' : 4462, 'sleep' : 9183, 'to', 10111, 'put down' : 6662, 'good' : Of 4444, 11063 'these' : 'song' : 7650, the "classics" : 9647, 'kiss', 1735, 'and' : 9789, 'Mrs' : 4345,' love ', 8485, 'time: 6969,' JiGuan ': Of 9594, 716 'a' : 'eating', 3397, 'shop', 5259, 'know', 9202, 'what', 1842, 'a', 607, 'love', 8482, 'happiness', 5221, 'nan' : 444, 'super' : 10823, 907 'song' : 'praised: 4234,' may ': 3361,' one hand: 3036, "ferrari" : 7987, ' 'a lot: 4104,' laughed: 4222, 'live' : 8045, 'goodbye' : 2488, big Slag: 4213... }Copy the code

View the results of the word frequency matrix:

1. [[0 0 0. 0 0 0] 2. [0 0 0. 0 0 0] 3 [0 0 0. 0 0 0] 4... 5. [0 0 0. 0 0 0] 6. [0 0 0... [0 0 0... 0 0]]Copy the code

Zeros fill the matrix. Look at the dimensions:

1.count.shape  
2.(20523, 12111)  
Copy the code

20,523 rows, 12,111 columns, that’s 20,523 documents, user comments, 12,111 words in the corpus. Check its data type:

1.type(count)  
2.scipy.sparse.csr.csr_matrix  
Copy the code

Okay sparse matrix.

1.3 Calculation of TF-IDF value:

Count tF-IDF weights for each term in CountVectorizer using the TfidfTransformer function:

1.tfidf_vec = TfidfVectorizer()   
2.tfidf_matrix = tfidf_vec.fit_transform(d.title_cutted)  
3.print(tfidf_matrix.toarray())  
Copy the code

Tf-idf array:

1. [[0. 0 0.... 0. 0. 0.] 2. [0. 0. 0.... 0. 0. 0.] 3 [0. 0. 0.... 0. 0. 0.] 4... 5. [0. 0. 0.... 6. 0. 0. 0.] [0. 0. 0.... 0. 0. 0.] 7. [0. 0. 0.... 0. 0. 0.]]Copy the code

The remaining steps are similar to above. Find corresponding functions to obtain user similarity.

Ii. Recommendation of Similar Users (R language Version)

2.1 Reading database

Set the working directory and import the function package:

1. Setwd (" G: \ \ \ \ \ \ year one course regression analysis ") 2. The library (xml2, lib = "G: \ \ \ \ \ R \ R language language learning the installation package") 3. The library (Rcpp, lib = "G: \ \ \ \ \ R \ R language language learning the installation package") 4. The library (slam, lib = "G: \ \ \ \ \ R \ R language language learning the installation package") 5. The library (NLP, lib = "G: \ \ \ \ \ R \ R language language learning the installation package") Library (tm,lib="G:\\R language \\R language learning \\ install package ") 7. Package ("rJava",lib="G:\\R \\R \ learning \\ install package ") 9. Library (rJava,lib="G:\\R \ learning \R \ install package ") 11. Library (Rwordseg,lib="G:\\R language \\R language learning \\ install package ")Copy the code

Import data, wherein, clipboard refers to huxiou. TXT is copied after opening with Notepad to prevent Chinese garbled characters

1.csv <- read.table("clipboard",header=T, stringsAsFactors=F,quote = "",encoding="utf-8")  
2.mystopwords<-unlist(read.table("StopWords.txt",stringsAsFactors=F,quote = ""))  
3.head(csv)  
4.dim(csv)  
5.colnames(csv)<-c("text")  
Copy the code

2.2 Defining data preprocessing functions

Remove numeric function

1. RemoveNumbers = function(x) {ret = gsub("[0-90123456789]","",x)}Copy the code

SegmentCN word segmentation function RmMSeg4J and RSmartCN can also be used for Chinese word segmentation

1.wordsegment<- function(x) {   
2.  library(Rwordseg)   
3.  segmentCN(x)  
4.}   
Copy the code

Remove the stop word function

1.removeStopWords = function(x,words) {       
2.  ret = character(0)   
3.  index <- 1   
4.  it_max <- length(x)   
5.  while (index <= it_max) {   
6.    if (length(words[words==x[index]]) <1) ret <- c(ret,x[index])   
7.    index <- index +1   
8.  }   
9.  ret   
10.}   
Copy the code

2.3 Removing A Number

1.sample.words <- lapply(data[,1], removeNumbers)   
2.dim(as.matrix(sample.words))  
3.head(sample.words)  
Copy the code

2.4 Chinese word segmentation

1.sample.words <- lapply(sample.words, wordsegment)   
2.dim(as.matrix(sample.words))  
3.sample.words[1:6]  
Copy the code

2.5 Remove the stop word

Process Chinese word segmentation first and then stopWords to prevent global replacement of lost information.

1.sample.words <- lapply(sample.words, removeStopWords, mystopwords)   
2.head(sample.words)  
3.text<-sample.words[,1]  
4.colnames(sample.words)<-c("text")  
5.write.csv(as.matrix(sample.words),"delateddata.csv")  
Copy the code

2.6 Corpus construction

1.corpus = Corpus(VectorSource(sample.words))   
2.meta(corpus,"cluster") <- csv$type   
3.unique_type <- unique(csv$type)   
4.corpus  
Copy the code

2.7 Establish document-term matrix

1.(sample.dtm <- DocumentTermMatrix(corpus, control = list(wordLengths = c(2, Inf))))  
Copy the code

The next step is to calculate tF-IDF value and similarity, obtain similar users, and complete user recommendation.

zhi

  支

chi

  持

zuo

  作

zhe

  者

Long click scan code to encourage the author

\

Use OpenCV and OCR to identify the table data in the image

Click to become a registered member of the community ** “Watching” **