Last year, there was a famous article on the Internet: I analyzed 420,000 words of lyrics to find out what folk singers were singing. The author of this article is my roommate in college. Subsequently, various articles with the name of XXX appeared on the Internet. I thought about it, could I do something with crawlers too? First, I also take songs as the entry point — Jay Chou, yes, the life and growth of our generation is always accompanied by Jay’s voice, so I will crawl jay’s songs, song comments, lyrics, and all kinds of useful information and make a visualization.

This article is suitable for python pure white, because I am new to Python, many of the statements in it may be long, and there may even be some undiscovered bugs, as we continue to learn to slowly eliminate this. Here is a summary of what I will use: This article uses two interpretive programming languages Python + bash(shell), why shell, I will analyze in detail later. Used the module of requests, re, OS, jieba, glob, json, LXML, pyecharts, heapq, collections. It must be a headache to see so many modules, and I didn’t expect to use so many at first. But as the program progresses, these modules naturally appear in the program, and we don’t need to know anything about each module. But usage needs to be mastered. So without further ado, let’s get to the point.

I. Find the content to be climbed, analyze the web page, and capture packages to view interactive content

First let’s go to the address of the content we need to grab.music.163.com/# This is the front page of netease Cloud Music. Our goal is to capture all songs, lyrics and comments of Jay Chou. Then we will enter Jay Chou in searchAfter getting this picture, we found that there are only 50 songs in it (many people only select TOP50 songs when analyzing netease cloud songs), we want all of them, so this URL does not meet the requirements, we continue to look for other URL addresses. I spent a lot of time here, and finally found an indirect method, first grab all of Jay’s album information, and then through the album information to find all the songs (I haven’t found any way to directly get all the song names on netease cloud). Okay, with the guidelines set, our first step is to grab all albums and enterMusic.163.com/#/artist/al…As shown below!In this we can see all the album information of Jay Chou click on the next page to observe the URL found to become http://music.163.com/#/artist/album?id=6452&limit=12&offset=12!! So those of you with a little HTML background know that limit=12 is the number of albums per page. OK, now let’s get the album! We input http://music.163.com/#/artist/album?id=6452&limit=100&offset=12 on the page (change it to 100 to avoid multiple capture and complete the capture at one time), and check the interactive information in Google’s packet capture tool (F12) and find the following:

Yes you read that right, this is the information we want, then things will be simple, we don’t need to use a complex tool like Selenium to load the entire page, (in fact, if you haven’t figured out how to grab songs, I guess it will be used), let’s look at what’s in the headerWe don’t have to worry about the string here, because it’s already in our URL, we just have to look at the Request Headers and that’s what we’re sending to the server, and when we send it, the server is sending us back information from the network. Ok, next we’ll fake the browser and send the request. The specific code is as follows:

def GetAlbum(self): urls="http://music.163.com/artist/album?id=6452&limit=100&offset=0" headers={ 'Accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q =0.8', 'accept-encoding ':'gzip, deflate, br',' accept-language ':' zh-cn,zh; Q = 0.9 ', 'Connection' : 'keep alive -', 'cookies' :' _iuqxldmzr_ = 32; _ntes_nnid = dc7dbed33626ab3af002944fabe23bc4, 1524151830800; _ntes_nuid=dc7dbed33626ab3af002944fabe23bc4; __utmc=94650624; __utmz = 94650624.1524151831.1.1. Utmcsr = (direct) | utmccn = (direct) | utmcmd = (none); __utma = 94650624.1505452853.1524151831.1524151831.1524176140.2; WM_TID=RpKJQQ90pzUSYfuSWgFDY6QEK1Gb4Ulg; JSESSIONID-WYYY=ZBmSOShrk4UKH5K%5CVasEPuc0b%2Fq6m5eAE91jWCmD6UpdB2y4vbeazO%2FpQK%5CgiBW0MUDDWfB1EuNaV5c4wIJZ08hYQKDhpsHn DeMAgoz98dt%2B%2BFfhdiiNJw9Y9vRR5S4GU%2FziFp%2BliFX1QTJj%2BbaIGD3YxVzgumklAwJ0uBe%2FcGT6VeQW%3A1524179765762; __utmb = 94650624.24.10.1524176140 ', 'Host' : 'music.163.com', 'Referer' : 'https://music.163.com/', 'upgrade-insecure -Requests':'1',' user-agent ':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/63.0.3239.132 Safari/537.36'} HTML = requests. html_data=html1.xpath('//div[@class="u-cover u-cover-alb3"]')[0] pattern = re.compile(r'<div class="u-cover U-cover-alb3 "title=(.*?)>') items = re.findall(pattern, html.text) CAL =0 If (os.path.exists(" os.path.txt ")): os.remove(" os.path.exists(" os.txt ")) Os.remove ("专辑 专辑 信息.txt") for I in items: P = i.place ('"','') # Note that under running pattern1=re.com (r '< a href = "/ album \? Id = (. *?)" class = "tit s - fc0" > % s < / a >' % (p)) id1 = re. The.findall (pattern1, HTML. Text) # print(" album name :%s!! Album: ID is % s "% (I, items1)) with open (" album information. TXT", 'a') as f: f.w rite (" is the name of the album: % s!!!!! Album ID is % s \ n "(I, id1)) % f. lose () the self. GetLyric1 # (I, id1) print (" % d is the total number" % (CAL) print (" get album and the album ID success!!!!!!!!" )Copy the code

 

Xpath is used to find the data in the corresponding tag, the code is not important, the idea is understood (code alone can be executed).

The result is as follows

 

 

Two. Grab song information.

Through the above we have captured the album information, next we will get the song information through the album

Look at this picture, I think you get the idea, page composition http://music.163.com/#/album?id=!! !!!!!!!!! So I’m going to put in the album ID, and we found all the songs in network and then we’re going to look at the headerIn the same way we send messages through forged ways to get the song information!! Go straight to code

def GetLyric1(self,album,id1): Urls1 =" http://music.163.com/#/album?id=" urls2 = STR (id1) urls3= urls1+urls2 # urls=urls3.replace("[","").replace("]","").replace("'","").replace("#/","") headers={ 'Cookie': '_iuqxldmzr_=32; _ntes_nnid = dc7dbed33626ab3af002944fabe23bc4, 1524151830800; _ntes_nuid=dc7dbed33626ab3af002944fabe23bc4; __utmz = 94650624.1524151831.1.1. Utmcsr = (direct) | utmccn = (direct) | utmcmd = (none); __utma = 94650624.1505452853.1524151831.1524176140.1524296365.3; __utmc=94650624; WM_TID=RpKJQQ90pzUSYfuSWgFDY6QEK1Gb4Ulg; JSESSIONID-WYYY=7t6F3r9Uzy8uEXHPnVnWTXRP%5CSXg9U3%5CN8V5AROB6BIe%2B4ie5ch%2FPY8fc0WV%2BIA2ya%5CyY5HUBc6Pzh0D5cgpb6fUbRKM zMA%2BmIzzBcxPcEJE5voa%2FHA8H7TWUzvaIt%2FZnA%5CjVghKzoQXNM0bcm%2FBHkGwaOHAadGDnthIqngoYQsNKQQj%3A1524299905306; __utmb = 94650624.21.10.1524296365 ', 'Host' : 'music.163.com', 'Referer' : 'http://music.163.com/', 'upgrade-insecure -Requests': '1',' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36'} HTML = requests. Get (urls, headers=headers) html1 = etree.HTML(html.text) # soup = BeautifulSoup(html1, 'html.parser', from_encoding='utf-8') # tags = soup.find_all('li', class_="have-img") html_data = html1.xpath('//ul[@class="f-hide"]//a') for i in html_data: Html_data1 =i.xpath('string(.) ') Id # ') to get song html_data2 = STR (html_data1) running the pattern1=re.com (r '< li > < a href = "/ song \? Id = (\ d +?)" > % s < / a > < / li >' % (html_data2)) Items = re.findall(pattern1, htm.text) # print(" track name: %s"%(html_data2)) # print(" track id: %s") # print(" track id: %s"%(html_data2) %s"%(items)) with open(".txt", 'a') as f: print(len(items)) if (len(items) > 0): f.write(" %s \n" %(html_data2, items)). Lose () print(" %s \n" %(html_data2, items)). items)) #http://music.163.com/#/song?id=185617 # if(len()) def GetLyric2(self): For I in glob. Glob ("* 评 论 *"): os.remove(I) for I in glob. Glob ("* 评 论 *"): List_of_line = file_object.txt () list_of_line=file_object.readlines() aaa=1 namelist = "for I in List_of_line: # The ID of the song is ['186020'] Pattern1 = re.pile (r' Song name is: (.*?)!! Track ID is ') pattern2 = re.compile(r' track ID is \[(.*?)\]') Items1 = STR (re.findall(pattern1, i)).replace("[","").replace("]","").replace("'","") items2 = str(re.findall(pattern2, i)).replace("[","").replace("]","").replace('"',"").replace("'","") headers = { 'Request URL': 'http://music.163.com/weapi/song/lyric?csrf_token=', 'Request Method': 'POST', 'Status Code': '200 OK', 'Remote Address': '59.111.160.195:80' 'no-referrer-when-downgrade' } # http://music.163.com/api/song/lyric?id=186017&lv=1&kv=1&tv=-1 urls="http://music.163.com/api/song/lyric?"+"id="+str(items2)+'&lv=1&kv=1&tv=-1' # urls = "http://music.163.com/api/song/lyric?id=186018&lv=1&kv=1&tv=-1" #print(urls) html = requests.get(urls, headers=headers) json_obj = html.text j = json.loads(json_obj) try: lrc = j['lrc']['lyric'] pat = re.compile(r'\[.*\]') lrc = re.sub(pat,"",lrc) lrc = lrc.strip() print(lrc) lrc = str(lrc) With the open (" song name - "+ +". TXT "items1, 'w', encoding =" utf-8 ") as f: Self.getcmmons (items1,items2) except: self.getcmmons (items1,items2) self.getcmmons (items1,items2) except: self.getcmmons (items1,items2) self.getcmmons (items1,items2) except: Print (" error %s!! # html1 = etree.html (html.text) print(" %s "%(aaa)) print(namelist)Copy the code

 

 

Note above: xpath to get the information you need, and re to get the ID (there are many ways)

The results are as follows.

 

Same method!! We turn on a songIn the same way, we analyze network to get the information we need lyrics, comments!! Go straight to code

def GetCmmons(self,name,id): Self. name=name self.id=id # urls="http://music.163.com/weapi/v1/resource/comments/R_SO_4_415792918?csrf_token=" urls="http://music.163.com/api/v1/resource/comments/R_SO_4_"+str(id) headers={ 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q =0.8', 'accept-encoding ': 'gzip, deflate',' accept-language ': 'zh-cn,zh; Q = 0.9 ', 'cache-control' : 'Max - age = 0', 'Connection' : 'keep alive -', 'cookies' :' _iuqxldmzr_ = 32; _ntes_nnid = dc7dbed33626ab3af002944fabe23bc4, 1524151830800; _ntes_nuid=dc7dbed33626ab3af002944fabe23bc4; __utmz = 94650624.1524151831.1.1. Utmcsr = (direct) | utmccn = (direct) | utmcmd = (none); WM_TID=RpKJQQ90pzUSYfuSWgFDY6QEK1Gb4Ulg; JSESSIONID-WYYY=BgqSWBti98RpkHddEBZcxnxMIt4IdbCqXGc0SSxKwvRYlqbXDAApbgN%2FQWQ8vScdXfqw7adi2eFbe30tMZ13mIv9XOAv8bhrQYC6KR ajksuYWVvTbv%2BOu5oCypc4ylh2Dk5R4TqHgRjjZgqFbaOF73cJlSck3lxcFot9jDmE9KWnF%2BCk%3A1524380724119; __utma = 94650624.1505452853.1524151831.1524323163.1524378924.5; __utmc=94650624; __utMB =94650624.8.10.1524378924', 'Host': 'music.163.com', 'upgrade-insecure -Requests': '1',' user-agent ': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, } HTML = requests. Get (urls,headers=headers) HTML. Encoding = 'utf8' # html_data = html1.xpath('//div[@class="u-cover u-cover-alb3"]')[0] # pattern = re.compile(r'<div class="u-cover u-cover-alb3" title=(.*?)>') #items = re.findall(pattern, Json_obj = html.text j = json.obj) I =j['hotComments'] for uu in I: print username=uu["user"]['nickname'] likedCount1 = str(uu['likedCount']) comments=uu['content'] with open(name + HotComment "+".txt", 'a+',encoding='utf8') as f: F word (" username "+username+"\n") f word (" comments "+ "\n") f word (" likedCount1 "+ "\n") f word (" likedCount1 "+ "\n") F.w rite (" -- -- -- -- -- -- -- -- -- -- the luxuriant line -- -- -- -- -- -- -- -- -- -- -- -- -- "+" \ n ") f. lose ()Copy the code

 

Get the data you need using JSON (at least faster than regex)

The results are as follows:

 

 

Here!! We’ve done all the data crawling

Data analysis and visualization

If the data is not utilized, it is like a blank sheet of paper, and then we do a comprehensive analysis of the data

The first step is to merge the data into a file

Def MergedFile(self): aaa=0 for I in glob. Glob ("* * * * *"): file_object = open(i,'r',encoding='UTF-8') list_of_line = file_object.readlines() for p in list_of_line: If "lyricist" in P or "composer" in P or "mixing assistant" in P or "sound mixer" in P or "sound engineer" in P or "executive producer" in P or "recording engineering" In p or "recording studio" in p or "Programmer" in p or P == "\n" or "harmony" in P or "guitar" in P or "recording assistant" In p or "in p or" in p: print(p) else: with open ("allLyric"+".txt","a",encoding=' utF-8 ') as f: print(p) else: with open ("allLyric"+".txt","a",encoding=' utF-8 ') as f: F (p) f ("\n") print(aaa) # file1 = open(' alllyric.txt ', 'r', File2 = open(' alllyric1.txt ', 'w', encoding=' utF-8 ') file2 = open(' alllyric1.txt ', 'w', encoding=' utF-8 ') for line in file1.readlines(): if line == '\n': line = line.strip("\n") file2.write(line) finally: File1.close () file2.close() print(" Merge file done ")Copy the code

 

Note above: when we merge data, we can selectively delete some useless data

The results are as follows

Here’s our emotional analysis of Jay Chou’s songs

def EmotionAnalysis(self): From snownlp import snownlp from Pyecharts import Bar xzhou=[] yzhou=[] for I in glob. Glob ("* 唱 名*"): count=0 allsen=0 with open(i,'r', encoding='utf-8') as fileHandel: fileList = fileHandel.readlines() for p in fileList: If "lyricist" in p or "composer" in p or "drum" in p or "sound mixer" in P or "sound engineer" in P or "executive production" in P or "arrangement" in P or "producer" in P or "sound engineering" in In p or "Programmer" in p or p == "\n": pass else: s = SnowNLP(p) # print(s.sentences[0]) s1 = SnowNLP(s.sentences[0]) #print(type(s1)) count+=1 allsen+=s1.sentiments i=str(i) xzhou1 = i.split("-", 1)[1].split(".",1)[0] xzhou.append(xzhou1) avg=int(allsen)/count yzhou.append(avg) #print("%s "%(I,avg)) Filehandel.close () bar = bar (" bar data stack example ") bar. Add (" xzhou, yzhou ", Is_stack =True, xaxis_Interval =0) bar.render(r"D:\ learn \ Untitled4 \ Allpicture \ zhouzhou1 ") # import heapq yzhou1 = heapq.nlargest(10, yzhou) temp = map(yzhou.index, heapq.nlargest(10, yzhou)) temp = list(temp) xzhou1 = [] for i in temp: Xzhou1. Append (xzhou[I]) # xzhou1. Append (xzhou[I]) # xzhou1. Is_stack =True) bar.render(r"D:\ learn \untitled4\allpicture\ zhouzhou1 = heapq.nsm.html ") # display top10 songs yzhou1 = heapq. nsm.html (10, yzhou) temp = map(yzhou.index, heapq.nsmallest(10, yzhou)) temp = list(temp) xzhou1 = [] for i in temp: Xzhou1. Append (xzhou [I]) # print (xzhou1) # print (yzhou1) # mood a figure bar top 10 song = bar (" jay Chou songs mood worse "top 10 song) Add (" Emotion Visualization of Jay Chou Songs ", Xzhou1, Yzhou1,xaxis_interval=0, xzhou1_label_textSize =6) bar.render(r"D:\ learn \ Untitled4 \ Allpicture \ Chou songs most negative mood top10.html") print(xzhou1)Copy the code

 

The following data word frequency analysis is completed

Def splitSentence(self,inputFile, outputFile): fin = open(inputFile, 'r', encoding='utf-8') fout = open(outputFile, 'w', encoding='utf-8') for line in fin: line = line.strip() line = jieba.analyse.extract_tags(line) outstr = " ".join(line) fout.write(outstr + '\n') Fin.close () fout.close() # f = open(".txt", 'r', encoding='utf-8') a = f.read().split() b = sorted([(x, a.count(x)) for x in set(a)], key=lambda x: x[1], reverse=True) print(sorted([(x, a.count(x)) for x in set(a)], key=lambda x: Def LyricAnalysis(self): x[1], reverse=True) def LyricAnalysis(self): x[1], reverse=True Import jieba file = 'alllyric1.txt' # Note that alllyric = STR ([line.strip() for line in Open (' alllyric1.txt ',encoding=" utF-8 ").readlines()]) # open(' alllyric1.txt ',encoding=" utF-8 ").readlines() In a row alllyric1 = alllyric. Replace (" '", ""). The replace (" "," "). The replace ("?" ,"").replace(",","").replace('"','').replace("?" ,"").replace(".","").replace("!" Replace (":","") # print(alllyric1) # print(alllyric1) # print(alllyric1) # print(alllyric1) # print(alllyric1) # print(alllyric1) # print(alllyric1) # print(alllyric1) # print(alllyric1) # print(alllyric1) Set_stop_words ("ting.txt") self.splitsentence (' alllyric1.txt ', Import Collections # Read text files, split all Chinese characters into a list f = open(" TXT ", 'r', encoding='utf8') Txt1 = f.read() txt1 = txt1.replace('\n', '') # delete txt1 = txt1.replace('\n', '') # delete txt1 = txt1.replace(' ', Replace ('.', '') # delete comma txt1 = txt1.replace('.', '') # delete period txt1 = txt1.replace('.', '') # delete period txt1 = txt1.replace('o', '') Mylist = list(txt1) mycount = collections.counter (mylist) for key, val in mycount.most_common(10): Print (key, val)Copy the code

 

All right!! In fact, analyzing grammar is not so important, there are many ways to experiment!!

Let’s take a look at the results