This is the 23rd day of my participation in Gwen Challenge
01, preface,
I want to “mine” the user and movie data of “Douban” to analyze the relationship between users and movies as well as between them. The data volume is at least ten thousand levels.
But in the process of crawling, there is a crawling mechanism, so here to share with you how to solve the crawling problem? (Take Douban website as an example)
02. Problem analysis
At first the code
headers = {
'Host':'movie.douban.com'.'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'.'cookie':'bid=uVCOdCZRTrM; douban-fav-remind=1; __utmz = 30149280.1603808051.2.2. Utmcsr = Google | utmccn = (organic) | utmcmd = organic | utmctr = (not % 20 provided); __gads=ID=7ca757265e2366c5-22ded2176ac40059:T=1603808052:RT=1603808052:S=ALNI_MYZsGZJ8XXb1oU4zxzpMzGdK61LFA; _pk_ses. 100001.4 cf6 = *; __utma = 30149280.1867171825.1603588354.1603808051.1612839506.3; __utmc=30149280; __utmb = 223695111.0.10.1612839506; __utma = 223695111.788421403.1612839506.1612839506.1612839506.1; __utmz = 223695111.1612839506.1.1. Utmcsr = (direct) | utmccn = (direct) | utmcmd = (none); __utmc=223695111; Ap_v = 0,6.0; __utmt=1; dbcl2="165593539:LvLaPIrgug0"; ck=ZbYm; push_noty_num=0; push_doumail_num=0; __utmv = 30149280.16559; __utmb = 30149280.6.10.1612839506; _pk_id. 100001.4 cf6 = e2e8bde436a03ad7. 1612839506.1.1612842801.1612839506. '.'accept': 'image/avif,image/webp,image/apng,image/*,*/*; Q = 0.8 '.'accept-encoding': 'gzip, deflate, br'.'accept-language': 'zh-CN,zh; Q = 0.9 '.'upgrade-insecure-requests': '1'.#'referer':'',
}
url = "https://movie.douban.com/subject/24733428/reviews?start=0"
r = requests.get(url, headers=headers)
Copy the code
Here’s the basic crawler code that sets headers (including cookies) in Requests and crawls data normally if there is no crawler mechanism.
But **** “douban” **** website has anti-crawling mechanism!!
After only a few pages of crawling, this verification appears!!
More important is: after the verification and then crawl, a few seconds later this again, even if set a few seconds to crawl once can not solve!
03. Solutions
Solution guess
According to years of crawler experience, the first thing that comes to mind is to set up IP proxy, which is equivalent to different users crawling the website, so we try to solve the anti-crawling mechanism of **** “douban” **** through IP proxy.
Obtain a large number of IP proxies
If you simply set up an IP proxy, it is no different from us before we climb on our own computer, so we need a large number of IP proxy, through the way of random selection to use IP proxy, so that you can avoid the same IP to climb by **** “douban” **** anti-climb mechanism forbidden climb.
IP proxy normally, very expensive, as white whoring party, here use free IP proxy (pro test available)
White piao process
https://h.shenlongip.com/index/index.html
Copy the code
White whoring IP agent platform is: Shenlong Http, (here is not advertising, just feel can white whoring, share with you)
After registering, you can obtain 1000 IP proxies for free.
This allows us to put the extracted IP proxy into a text file.
Setting the IP Proxy
Read IP proxy
iplist=[]
with open("IP proxy. TXT") as f:
iplist = f.readlines()
Copy the code
You’ve just saved all the IP in a text file. Now read it and put it in iplist
Randomly extract IP proxies
Obtain IP proxy
def getip() :
proxy= iplist[random.randint(0.len(iplist)-1)]
proxy = proxy.replace("\n"."")
proxies={
'http':'http://'+str(proxy),
#'https':'https://'+str(proxy),
}
return proxies
Copy the code
The Random function can be used to randomly extract IP proxies from the Proxies collection of Iplist and encapsulate them into proxies in proxies format.
** Note: ** here HTTPS is commented out, because MY IP proxy is HTTP, so there is an error for HTTPS!
IP proxy code
headers = {
'Host':'movie.douban.com'.'User-Agent':'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'.'cookie':'bid=uVCOdCZRTrM; douban-fav-remind=1; __utmz = 30149280.1603808051.2.2. Utmcsr = Google | utmccn = (organic) | utmcmd = organic | utmctr = (not % 20 provided); __gads=ID=7ca757265e2366c5-22ded2176ac40059:T=1603808052:RT=1603808052:S=ALNI_MYZsGZJ8XXb1oU4zxzpMzGdK61LFA; _pk_ses. 100001.4 cf6 = *; __utma = 30149280.1867171825.1603588354.1603808051.1612839506.3; __utmc=30149280; __utmb = 223695111.0.10.1612839506; __utma = 223695111.788421403.1612839506.1612839506.1612839506.1; __utmz = 223695111.1612839506.1.1. Utmcsr = (direct) | utmccn = (direct) | utmcmd = (none); __utmc=223695111; Ap_v = 0,6.0; __utmt=1; dbcl2="165593539:LvLaPIrgug0"; ck=ZbYm; push_noty_num=0; push_doumail_num=0; __utmv = 30149280.16559; __utmb = 30149280.6.10.1612839506; _pk_id. 100001.4 cf6 = e2e8bde436a03ad7. 1612839506.1.1612842801.1612839506. '.'accept': 'image/avif,image/webp,image/apng,image/*,*/*; Q = 0.8 '.'accept-encoding': 'gzip, deflate, br'.'accept-language': 'zh-CN,zh; Q = 0.9 '.'upgrade-insecure-requests': '1'.#'referer':'',
}
url = "https://movie.douban.com/subject/24733428/reviews?start=0"
r = requests.get(url, proxies=getip(), headers=headers, verify=False)
Copy the code
After the IP proxy was added, hundreds of pages were crawled and no authentication problems were encountered. There is no problem with easily climbing ten thousand levels of data.
8677 data has been crawled, no validation has occurred, the program continues to run ~~~
The time interval
If you still encounter validation, you can add a time interval to pause the application for a few seconds before each page is climbed (custom)
time.sleep(random.randint(3.5))
Copy the code
Random.randint (3,5) is a random number generated between 3 and 5, so the program randomly pauses for 3 and 5 seconds after a crawl. This is also an effective mechanism to prevent triggering anti-crawl.
04,
-
Explains how IP proxy and time interval are used to solve anti-crawl authentication problems
-
White whoring available IP proxy
-
The good data will be further analyzed and mined, and this paper will explain how to solve the problem of crawler backcrawling (after all, we have precious time and are fragmented reading, so it is difficult to digest too much content at a time).