This is the 9th day of my participation in the August More Text Challenge. For details, see:August is more challenging
Background: Why crawl mp3? Can’t afford a membership because you’re charging for downloads? Or can’t afford the traffic to listen online? From a technical point of view, if the above problems are solved through code, then it is just a show!
A, reptile, in the Internet era has been a common occurrence of things, I do not have to consider to take doctrine first! For example, some middlemen to earn profit profit company, the first statement, as long as not bad, climb a thing should not be illegal, but also do not do other uses to damage the “owner” interests. Where there is a reptile, there must also be a reptile. As we know, the crawler pretends that the client initiates a bunch of requests to obtain various data returned by the server. If the server does not allow the crawler, it needs to consider blocking the illegal requests. Here’s how to do it:
- IP, if I am acting as a proxy through Nginx, if I know that the same IP frequently requests a certain interface, I will think you are malicious, I may have to redirect this IP to other pages, of course, these conditions can also be fully controlled at the code level;
- Limit_req \limit_req_zone
- Request header: user-agent, proxy client, like some tools do crawler have their own special identifier, such as: Jmeter, python – requests, selenium, and so on, so in nginx used in the location of block if ($the user-agent ~ | XXX (XXX)) then redirected back other page;
- If the crawler is made by accessing THE DOM element of HTML page, then the element information corresponding to HTML can be changed to make the crawler invalid, or relocating development increases their development cost;
- Of course, there are other request header verification crawler over the anti-crawler information.
Second, since we know how to anti-reptilian, then reptilian for those who like to specialize in technology, then it is self-defeating.
Construct the request header:
- Get IP pool or generate your own IP library for crawler use:www.xicidaili.com/nn/This is basically a pool of invalid IP addresses;
- You need to generate your own IP address pool and monitor whether IP addresses are available
- User_agent, the Fake_userAgent library in Python, provides the useragent
from fake_useragent import UserAgent
Copy the code
- Referer if necessary this parameter refers to where the request came from, so requests to my server’s resources will only be allowed from this place;
- Of course, there are browser cookies, which are also relatively important for the crawler.
Three, understand the crawler and anti-crawler, then combined with the title example, we will download MP3 from 163 music website, under normal circumstances, is not to charge money is to pay for resources.
Step 1: Don’t write any code until you know where to download it. Music.163.com/# netease cloud download mp3 address
-
Wireshark, Charles, and THE F12 function of the browser
-
After the tool layer upon layer analysis, a little sorry, I can not find the right URL to download MP3
- Method one: Click the title to play music
-
- On the new page, click generate external link Player
-
- Access plug-in address
-
- Access to the url:Music.163.com/weapi/song/…Return the value of the URL under data:M801.music.126.net/20210302121…
-
- This address is accessible by using curl -o under DOS to download an MP3 of the address
-
Such a method to get an MP3 download address, however, this is not very hard to climb, although as long as click on the title to generate the playback page to change the ID can be: music.163.com/#/song?id=1…
- Method 2: never doesn’t ask baidu, so three times five divided by two get MP3 convert address is: music.163.com/song/media/… If you know the song ID, replace the red part to download;
- Use the browser access: music.163.com/song/media/…
Step 2: Now that you have the MP3 download address for the above steps, it’s time to start writing code
def download_by_songID(song_id, song_name) :
url="https://music.163.com/song/media/outer/url?id={}.mp3".format(song_id)
req = requests.get(url, headers=headers, allow_redirects=False)
req_url=req.headers['Location']
request.urlretrieve(req_url, path_config.music_path + "{}.mp3".format(song_name))
song_id = input("Please enter song_id:")
song_name = input("Please enter song_name:")
download_by_songID("{}".format(song_id), "{}".format(song_name))
except Exception as e:
Copy the code
Step 3: The above code has completed the MP3 download function, the next thing to do is to pretend that the anti-crawler recognized you, you need to construct headers
Remember the fake_userAgent library above?
from fake_useragent import UserAgent
headers = {"Connection": "keep-alive"."user-agent": ua.random,
"Host": "music.163.com"."sec-fetch-mode": "nested-navigate"
Copy the code
So far, netease Cloud MP3 download script completed, do you feel very good? Of course, it also has some mp3 downloads that you have to pay for.