This article is participating in Python Theme Month. See the link for details
Application scenarios
Anyone who has ever played with a crawler should run into this problem sooner or later: captchas. If this problem cannot be solved, it is estimated that the first step into reptilian learning will be filled with a big cold water. Because any website that’s even remotely anti-crawler will have a captcha. Today we are going to analyze this question in detail.
Solution 1: cookies
One of the things you have to know about a crawler is Cookie, sometimes used in the plural, Cookies. The type is small text file. It is the data (usually encrypted) stored on the user’s local terminal by some websites for identifying the user’s identity and Session tracking, and temporarily or permanently stored by the user’s client computer.
Have you noticed that if you only log in once, you don’t need to log in for a long time? That’s because of cookies. So all we have to do is request the site with the correct cookie, which bypasses login and bypasses captcha. However, some website cookies are very annoying, often change, do not know how often change. So whether to use this method depends on the actual effect.
Solution two: semi-automated crawler
Selenuim is a web site automation tool for selenuim. He can simulate artificial click website, operation website. In fact, with a little low way, can go into the web page, the problem is not big, that is, semi-automatic operation. Enter the verification code manually. You can get into the site eventually anyway. Beginners don’t have to worry about whether the captchas are entered manually, as long as the crawling is done in code.
Solution three: fully automated crawler
I did a good search for this captcha
It turns out that his return is passcode.aspx. The requested URL can also be found. We must feel very strange, tomorrow they are pictures, why it is not PNG or JPG? This is because the captcha responds dynamically, and when we click on it, it transforms into a different captcha. Every time we click, we’re going to request the connection.
Next, we need to save the captcha. It is in ASPX format and we need to save it as a GIF. The specific code is as follows:
import requests
url = 'http://appsso.pc139.zgyey.com/PassCode.aspx'
r = requests.get(url = url)
content = r.content
f =open(r'C:\Users\Administrator\Desktop\aaa.gif','wb')
f.write(content)
f.close()
Copy the code
Finally, the verification code can be successfully saved.
Next, what we need to do is to cut the picture, and then use the image recognition technology to identify the verification code, here the final result is a 3+6 formula. There are other captchas that are easier to identify, namely four numbers.
I will write a detailed article on how to identify captcha images in the future. That’s the whole crawler idea.
If you think it’s good, give it a thumbs up before you leave.