“This is the fourth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
Let’s review the content of yesterday’s article first, 1. 2. Download the web page. 3. Set the loading speed and locate the target file. First, open a song of netease Cloud
https://music.163.com/#/song?id=25723157
Copy the code
The ID of each song is the number that follows it. So in principle we just need to collect the corresponding song into the play page, we can get the ID number. This makes it easy to loop and crawl through all the reviews for multiple songs.
Def get_comments(url): headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36', 'referer': 'http://music.163.com/' } params = "EuIF/+GM1OWmp2iaIwbVdYDaqODiubPSBToe5EdNp6LHTLf+aID/dWGU6bHWXS0jD9pPa/oY67TOwiicLygJ+BhMkOX/J1tZMhq45dcUIr6fLuoHOECYrOU 6ySwH4CjxxdbW3lpVmksGEdlxbZevVPkTPkwvjNLDZHK238OuNCy0Csma04SXfoVM3iLhaFBT" encSecKey = "db26c32e0cd08a11930639deadefda2783c81034be6445ca8f4fbedd346e1f9567375083aeb1a85e6ad6d9ae4532a49752c2169db8bcc04d38a79f9 bed7facea42ee23f1b33538c34f82741318d9b4b846663b53b0b808dd0499dccfbc6c61fbf180c6fb24b1c2dd3c2c450ce09917d74be9424dab836fd 2e671988ffbc6ae1b" data = { "params": params, "encSecKey": encSecKey } name_id = url.split('=')[1] target_url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_{}?csrf_token=".format(name_id) res = requests.post(target_url, headers=headers, data=data) return resCopy the code
Params and encSecKey are the data that the server wants, and these are the encrypted content. We get the core.js file, and since the js file is so large, we can search for key code segments based on the keyword encSecKey and the same two parameters from POST can also be used for other songs.
Once the comments are found, we need to extract the key data. As you can see, the data is in JSON format. JSON is a lightweight data interchange format that is often used in network transport.
The json.loads () method is used to restore a string to a Python data structure:
comments_json = json.loads(res.text)
Copy the code
So the “hotComments” key in the dictionary is the value of all the great comments!
def get_hot_comments(res): comments_json = json.loads(res.text) hot_comments = comments_json['hotComments'] with open('hot_comments.txt', 'w', encoding='utf-8') as file: for each in hot_comments: File. The write (each [' user '] [' nickname '] + ': \n\n') file.write(each['content'] + '\n') file.write("---------------------------------------\n")Copy the code
At this point, we can combine the previous parts and get the desired effect!
Other form netease cloud music API interface is as follows: music.163.com/api/v1/reso…
It should be noted that if the crawler frequency is too fast and the number is too much, the server will block IP. Therefore, a more complete crawler project needs to set up proxy and IP pool.
In addition, if you don’t want to be stuck cracking post form parameters, you can try using Python + Selenium +PhantomJs to simulate user actions, click on the page, and then directly parse the page elements. This can be “visible and crawling”, but it is a little less efficient.