Simulated log in for Python crawlers

As those who often write crawlers know, some pages are forbidden to be captured before login. For example, the topic page of Zhihu requires users to log in to access it, and “login” is inseparable from the Cookie technology in HTTP.

Log in principle

The principle of Cookie is very simple, because HTTP is a stateless protocol, so in order to maintain the session state on the stateless HTTP protocol and let the server know which client it is dealing with, Cookie technology appears. A Cookie is an identifier assigned by the server to the client.

The browser does not carry any Cookie information when making the HTTP request for the first time
The server returns the HTTP response, along with a Cookie message, to the browser
The browser’s second request sends the Cookie information back to the server
When the server receives an HTTP request and finds a Cookie field in the request header, it knows it has dealt with the user before.

Practical application

As anyone who has used Zhihu knows, you can log in by providing your user name, password and verification code. Of course, this is just what we see. The technical details behind it need to be mined with the help of a browser. Now let’s use Chrome to see what happens when we fill out the form.

First enter the login page of Zhihu www.zhihu.com/#signin, open the Chrome developer toolbar (press F12) and try to enter a wrong verification code to observe how the browser sends the request.

Several key pieces of information can be found from the browser request

Login URL is www.zhihu.com/login/email
The user name (email), password (password), verification code (CAPtCHA), and _xSRF are required for login.
Get the verification code URL address is www.zhihu.com/captcha.gif…

What is _xsrf? If you’re familiar with the CSRF (Cross-site request forgery) attack, then you know what it does. XSRF is a string of pseudo-random numbers that are used to prevent cross-site request forgery. It usually exists in the form tag of a web page. To verify this, search for “XSRF” on the page, and sure enough, _xsrf is in a hidden input tag

Now that you’ve figured out how to get the data you need for a browser login, you can start writing code that simulates the browser login in Python. The two third-party libraries that you rely on to log in are Requests and BeautifulSoup, installed first

pip install beautifulsoup4==4.53.
pip install requests==2.13. 0Copy the code

The HTTP. cookiejar module can be used to automatically process HTTP cookies. The LWPCookieJar object is an encapsulation of cookies and supports saving cookies to and loading cookies from files.

The session object provides Cookie persistence and connection pooling functions. Requests can be sent through the session object

First load the cookie information from the cookie. TXT file, because there is no cookie in the first run, so LoadError will occur.

from http import cookiejar
session = requests.session()
session.cookies = cookiejar.LWPCookieJar(filename='cookies.txt')
try:
    session.cookies.load(ignore_discard=True)
except LoadError:
    print("load cookies failed")Copy the code

Get XSRF

We have already found the tag where XSRF is located, which can be easily obtained using BeatifulSoup’s find method

def get_xsrf(a):
    response = session.get("https://www.zhihu.com", headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    xsrf = soup.find('input', attrs={"name": "_xsrf"}).get("value")
    return xsrfCopy the code

Obtaining verification code

The captcha is returned via the /captcha. GIF interface. Here we download the captcha image and save it to the current directory for manual identification. Of course, you can use a third-party support library for automatic identification, such as Pytesser.

def get_captcha(a):
    """ Save the captcha image to the current directory, manually identify the captcha: """
    t = str(int(time.time() * 1000))
    captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + "&type=login"
    r = session.get(captcha_url, headers=headers)
    with open('captcha.jpg'.'wb') as f:
        f.write(r.content)
    captcha = input("Verification code:")
    return captchaCopy the code

The login

With all the parameters in place, you can request the login interface.

def login(email, password):
    login_url = 'https://www.zhihu.com/login/email'
    data = {
        'email': email,
        'password': password,
        '_xsrf': get_xsrf(),
        "captcha": get_captcha(),
        'remember_me': 'true'}
    response = session.post(login_url, data=data, headers=headers)
    login_code = response.json()
    print(login_code['msg'])
    for i in session.cookies:
        print(i)
    session.cookies.save()Copy the code

After a successful request, the session automatically fills the session.cookies object with the cookie information returned by the server, and the client can automatically carry these cookies to the page that needs to be logged in the next request.

Source: github.com/lzjun567/cr…

References:

For a brief introduction to the HTTP protocol, I recommend an article written in the public account “Zen of Python” to a complete HTTP request process
Docs.python.org/3/library/h…
Docs.python-requests.org/en/master/u…

Simulated log in for Python crawlers

Log in principle

Practical application

Get XSRF

Obtaining verification code

The login

Related Posts

JVM performance tuning (3) – Analyze garbage collection policies through GC logs

Dubbo coding and decoding all that stuff

ASP.NET has no magic — the EF entity classes and database structure of ASP.NET MVC and database