The latest Python simulation login zhihu in 2019 supports captcha and save Cookies

Zhihu login page has been revised several times, strengthens authentication, most of the simulation on the network login failed, so I’m rewriting a full, and implements the submit verification code verification code (including Chinese), in this paper, I analysis process and the code for step decomposition, the complete code please see lot at the end of the warehouse, but still suggest that over the text again, Because code will break sooner or later, parsing ideas are eternal.

Analyzing POST requests

First, open the console and log in normally. You can quickly find the API interface for login. This is the link for simulating login POST.

Our ultimate goal is to build the Headers and form-data objects required for the POST request.

Build the Headers

Looking further at the Requests Headers information, comparing it to the GET request on the login page, we found three additional authentication fields in the header of the POST that were required after testing the X-XSRFToken. X-xsrftoken is an anti-XSRF cross-site Token authentication. You can obtain the value from the set-cookie field of Response Headers when accessing the home page.

Build the Form Data

The Form part is now encrypted, so it can’t be seen directly. You can use the method of breaking points in JS (I won’t repeat the details here, if you don’t have breaking points, please search by yourself).

We then build each of the above parameters one by one: timestamp timestamp, which is easy to solve, except that it is a 13-bit integer, whereas Python generates an integer part with only 10 bits, which needs to be multiplied by 1000

timestamp = str(int(time.time()*1000))
Copy the code

The signature is generated in a JS file using the Hmac algorithm to encrypt fixed values and timestamps, so you only need to simulate this encryption once in Python.

def _get_signature(self, timestamp):
    ha = hmac.new(b'd1b964811afb40118a12068ff74a12f4', digestmod=hashlib.sha1)
    grant_type = self.login_data['grant_type']
    client_id = self.login_data['client_id']
    source = self.login_data['source']
    ha.update(bytes((grant_type + client_id + source + timestamp), 'utf-8'))
    return ha.hexdigest()
Copy the code

Captcha, which returns whether a captCHA is required (it is requested once regardless of whether the capTCHA is required) through the SEPARATE API interface of the GET request. If True, the Base64 encoding of the image needs to be PUT again.

resp = self.session.get(api, headers=headers)
show_captcha = re.search(r'true', resp.text)
if show_captcha:
    put_resp = self.session.put(api, headers=headers)
    json_data = json.loads(put_resp.text)
    img_base64 = json_data['img_base64'].replace(r'\n'.' ')
    with open('./captcha.jpg'.'wb') as f:
        f.write(base64.b64decode(img_base64))
        img = Image.open('./captcha.jpg')
Copy the code

There are actually two apis, one for identifying inverted Chinese characters, and the other is a common English verification code, you can choose one or the other. In this code, I have implemented both. Chinese characters are converted to JSON format by PLT clicking on coordinates. Finally, if you have a captcha, you need to POST the parameters of the captcha to the API first, and then to the login API along with the other parameters.

if lang == 'cn':
    import matplotlib.pyplot as plt
    plt.imshow(img)
    print('Click on all inverted Chinese characters and press Enter to submit')
    points = plt.ginput(7)
    capt = json.dumps({'img_size': [200.44].'input_points': [[i[0] /2, i[1] /2] for i in points]})
else:
    img.show()
    capt = input('Please enter the verification code in the picture:')
    # you must first put the parameter POST verification code interface
    self.session.post(api, data={'input_text': capt}, headers=headers)
    return capt
Copy the code

Add username and password and keep the other fields fixed.

self.login_data.update({
    'username': self.username,
    'password': self.password,
    'lang': captcha_lang
})

timestamp = int(time.time()*1000)
self.login_data.update({
    'captcha': self._get_captcha(self.login_data['lang']),
    'timestamp': timestamp,
    'signature': self._get_signature(timestamp)
})
Copy the code

Encrypted Form – the Data

But now Zhihu must encrypt the form-data before POST transmission, so we have to solve the encryption problem. However, since the JS we see is the code after confusion, it is very time-consuming to peek at the encryption implementation mode. So HERE I use sergioJune, a friend of mine, to call JS through PyexecJS for encryption. All I need to do is copy the confused code completely and make some modifications. Specific to see his original: zhuanlan.zhihu.com/p/57375111

with open('./encrypt.js') as f:
    js = execjs.compile(f.read())
    return js.call('Q', urlencode(form_data))
Copy the code

Here also thanks him to share some pits, otherwise really difficult to solve.

Save Cookies

Finally, a method is implemented to check the login state. If there is a jump when accessing the login page, it means that the login has been successful. At this time, Cookies are saved (here session. Cookies are initialized as LWPCookieJar object, so there is a save method). So that the next login can directly read the Cookies file.

def check_login(self):
    resp = self.session.get(self.login_url, allow_redirects=False)
    if resp.status_code == 302:
        self.session.cookies.save()
        return True
    return False
Copy the code

The complete code

Please pay attention to wechat public number: life-oriented programming reply keyword “Zhihu” to obtain the code