Tiger Sniff 24 hour uptick, a case study with a crawler tip, Python Crawler Lesson 7-9

“This is the 16th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Many platforms have a liking feature, and this idea can be used on many platforms. Hopefully, you can master this technique and implement your own liking device. The goal of this case is to like the tiger Smell 24-hour channel.

Analysis before crawling

Analyze hotspot data sources

This example will take some time in the analysis phase and will be a little bit more difficult, but after understanding it, you will have learned a very common crawler writing technique.

Drag the browser scroll bar to the bottom of the page to capture the request: https://moment-api.huxiu.com/web-v2/moment/feed The request and return data for this content are shown below.

The request mode is POST.

Last_dateline = last_dateline = last_dateline = last_dateline = last_dateline = last_dateline = last_dateline = last_dateline

The returned data has many nested levels and looks complex, but the data format is JSON, which can be parsed normally.

The code returns the value of last_dateline, which can be searched in an infinite loop until the value fails.

So far, data acquisition and analysis are completed, and the conclusions are as follows:

The request mode is POST.
The requested address is moment-api.huxiu.com/web-v2/mome…
Request parameters arelast_dateline.platform.is_ai;
The first request can be fixed firstlast_datelineThe value of the;
Get all hotspot information through a loop.

Analyze the likes interface

Click on the thumb up button to link to https://moment-api.huxiu.com/web/moment/agree, this link address for the requests of thumb up.

The request mode and address are as follows:

The request data is shown in the figure below:

In the parameter moment_id, it should be the hot monent_id obtained above. The specific position is shown as follows:

The data format returned after a successful request is shown below.

At this point, the “like” interface analysis is completed, and the conclusions are as follows:

The request mode is POST.
Request the address for moment-api.huxiu.com/web/moment/…
Request parameters aremonent_id.platform;
You can test your code by making a request,monent_idA fixed value of.

Crawler writing time

Requests POST Request introduction

Let’s start by looking at how the Requests library initiates POST requests.

POST requests can be used in Requests in several ways, as follows.

The most common implementation of post is simply passing a dictionary to the data parameter.

import requests
# This site is intended for testing use
url = 'http://httpbin.org/post'
data = {'key1':'value1'.'key2':'value2'}
r =requests.post(url,data)
print(r)
print(r.text)
print(r.content)
Copy the code

Notice If

exists in the returned data, it indicates success.

The data argument is passed a list of tuples

This is especially useful if multiple elements in the form to be submitted use the same key.

import requests
# This site is intended for testing use
url = 'http://httpbin.org/post'
payload = (('key1'.'value1'), ('key1'.'value2'))
r = requests.post('http://httpbin.org/post', data=payload)

print(r.text)
Copy the code

Pass a JSON string

Instead of sending data encoded as a form, you pass in a string, not a dictionary.

import requests
import json

# This site is intended for testing use
url_json = 'http://httpbin.org/post'
# dumps: Decodes Python objects into JSON data
data_json = json.dumps({'key1': 'value1'.'key2': 'value2'})
r_json = requests.post(url_json, data_json)

print(r_json.text)
Copy the code

Transfer the file

This is a little bit beyond the syllabus of the crawler class, so ignore it for now.

Notice that the difference between POST and GET is actually the difference between data. Now we can actually code it.

Get the data to be liked

First get a page of data to be liked, the code is as follows:

def get_feed() :
	# concatenate data dictionary
    data = {
        "last_dateline": "1605591900"."platform": "www"."is_ai": 0
    }
    r = requests.post(
        "https://moment-api.huxiu.com/web-v2/moment/feed", data=data)
    data = r.json()
    # data["success"]

    Get the data, note that there are a lot of nested layers, slowly corresponding to it
    datalist = data["data"] ["moment_list"] ["datalist"] [0] ["datalist"]
    print(datalist)
Copy the code

Select last_dateline from last_dateline and set it to null. I’ll leave the rest of the loop to you, but now that the crawler lesson is over, you’ll need to extend some of the basic Python syntax yourself.

Like code writing

I ran into a bit of a snag in the process of writing the like code, which involved the age-old problem of cookies.

import requests
headers = {
    "user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}

def agree(moment_id) :

    data = {
        "moment_id": moment_id,
        "platform": "www",
    }

    r = requests.post(
        "https://moment-api.huxiu.com/web/moment/agree", data=data, headers=headers)
    res_data = r.json()
    print(res_data)
    # print(res_data["success"] == "true")

if __name__ == "__main__":
    agree(130725)
Copy the code

As the above code is found running, the result of the request is:

{'success': False.'message': 'Please open cookies'}
Copy the code

A second check of the request captured by the developer tool did find a unique cookie as follows:

A huxiu_analyzer_wcy_id parameter, which seems to be the problem. This value becomes a bit more cumbersome to get, and you need to find a way to capture it.

This place eraser is solved by following the steps below, hopefully you can learn this type of problem or common way to find cookies.

Open the Google browser non-trace mode, enter https://www.huxiu.com/moment/ directly to jump, open non-trace mode because you need to make sure that cookies are not cached.

Next, retrieve the huxiu_analyzer_wcy_id parameter directly from the developer tools.

Get the following:

This is where the cookie is set, and the cookie is responded to and set by the checkLogin interface.

The next solution is pretty crude: look at how checkLogin is called, ask for it, save the cookie value of its response, and then ask for the like interface.

Finally modify the like code to complete the task:

def agree(moment_id) :
    s = requests.Session()
    r = s.get("https://www-api.huxiu.com/v1/user/checklogin", headers=headers)

    jar = r.cookies
    print(jar.items())
    data = {
        "moment_id": moment_id,
        "platform": "www",
    }

    r = s.post(
        "https://moment-api.huxiu.com/web/moment/agree", data=data, headers=headers, cookies=jar)
    res_data = r.json()
    print(res_data)
    # print(res_data["success"] == "true")
Copy the code

Run the above code to get the correct thumbs-up.

{'success': True.'message': 'Liked successfully'}
Copy the code

Last but not least, two words

This blog post introduces the Requests library for post requests, along with a general approach to cookie retrieval. It is important to note that developer tools are often used in crawler programming, especially when dealing with JAVASCRIPT encryption issues.

Tiger Sniff 24 hour uptick, a case study with a crawler tip, Python Crawler Lesson 7-9

Analysis before crawling

Analyze hotspot data sources

Analyze the likes interface

Crawler writing time

Requests POST Request introduction

Get the data to be liked

Like code writing

Last but not least, two words

Related Posts

Java based dynamic proxy implementation

More elegant Kubernetes cluster event measurement solution

[Spring Boot] Spring Boot + Redis implements session sharing management