Analysis and comparison of acquisition schemes

At present, the comparison of different collection schemes for crawling public articles and dynamic information is as follows:

From the above figure: if you need to monitor the public real-time articles for a long time, I recommend using the reverse way; If you want to get the article to read the number of likes and comments or sogou micro channel to permanent link and other interfaces, recommend the use of universal key; As for the way of middleman, the technical threshold is low, the development cycle is short, if there are not many public numbers to monitor, and the effectiveness requirements are not so high, I recommend using this way.

The principle of man-in-the-middle collection is described in detail below

Detailed description of collection scheme

Man-in-the-middle approach

Acquisition principle

A middleman is a packet capture tool. The schematic diagram is as follows

The wechat client can see the article information because it requests the wechat server. After receiving the request, the server will return the corresponding article to the client. Here we use the packet capture tool (middleman) to intercept the data, the intercepted article data parsing into the library, to complete a simple data capture.

So how to achieve a number of articles automatic capture, and the list page automatically turn the page. You can’t order human flesh. The first thing that comes to mind is an automated tool, like the button Sprite that everyone knows. But how this automated tool interacts with the packet capture tool is a problem. We need to make sure that after data is intercepted in the repository, we can click on the next target to be fetched, or when the network is abnormal, the automated tool can detect, and then swipe the current page to initiate the request. Even if it could be done, it would be cumbersome, so it’s not done. I don’t like automated tools either, I always feel they are unstable…

Since the wechat article interface is HTML, we can embed JS, let it jump automatically. So how to embed your own JS in the article and source code? This is where the middleman comes in. Since he can intercept the data, he can of course modify it and return it to the client. So it works.

Code parsing

Now that we know how the man in the middle works, here’s how the code works. The language is python3 and the packet capture tool is mitmProxy. The proxy address warehouse is: github.com/striver-ing…

You can download the code first and look at it in this article

The directory structure of this project is:

Yaml # ├─ Core # ├─ Capture_packe.py # ├── Flag # flag # Data_pipeline. # p y data warehousing │ ├ ─ ─ deal_data. # p y data processing │ └ ─ ─ task_manager. Py # task scheduling ├ ─ ─ create_tables. Py # to create table ├ ─ ─ the db # Database encapsulation │ ├ ─ ─ mysqldb. Py # mysql database │ └ ─ ─ redisdb. Py # redis database ├ ─ ─ run. Py # start entrance └ ─ ─ utils # toolkit ├ ─ ─ the py # log ├ ─ ─ ├ ─ tools.py # Some function encapsulationCopy the code

capture_packet.py

This module code is used to intercept the data from the wechat server to the client, and then give the intercepted data to deal_data.py for processing, and then inject JS back to the wechat client.

The red boxes are for blocked packet rules, such as the return packet address containing /s? __biz=, then the data of the packet will be intercepted. What packet does each rule represent? There are comments in the code to show the packet

The code for injecting JS is

Next_page is the injected js with the value returned in task_manager.py as follows:

Core js as

<script>setTimeout(function(){window.location.href='url'; },sleep_time); </script>Copy the code

That is, a timer is set to jump to a specified URL at a certain interval. The URL is the next target we want to grab, can be the address of the article, can be the address of the next page of the history page, etc.

pit

Pit 1: The first page of the list page is HTML, js can be injected, and the packet format is JSON when the page is turned again. Js injection does not take effect. So you need to change the return header

Pit 2: The article page has a security mechanism, the external injection js does not take effect, also need to change the returned header. As follows:

To optimize the

In order to make wechat client page load faster, reduce unnecessary network requests. We can remove images and videos from the page as follows:

This part of the code is to replace the IMG tag in the data returned to the wechat client with empty, so the client will naturally not load pictures, the same principle does not load videos.

The code of this module is the core of the core, and it is also the whole code of the middleman. If you don’t understand it for the first time, you can understand it repeatedly for several times, and then look down

deal_data.py

The code of this module is data cleaning and warehousing, and the code is as follows:

__parse_account_info: parses public id information __parse_article_list and deal_article_list: parses article list: parses article deal_article_dynamic_info: Deal_comment: parses comment information. Get_task: obtains the next task

Here’s a detail: After processing the data, return the js to be injected (i.e. the next page to fetch) to Capture_packet.py for subsequent automatic fetching of other articles or history columns. However, the interfaces to read the likes and comments receive requests at the same time that the article address is accessed, so these parsing functions do not need to return the injected JS.

The specific code execution logic is shown in the figure above. Here, it is recommended to grab the data package to analyze, and then combine the code, so that it is easy to understand

task_manager.py

This module is for task management, obtain the task from Redis first, if not in Redis, then from mysql, and then add to Redis

The key jump code in next_page is:

Jump to the next URL

<script>setTimeout(function(){{window.location.href='{url}'; }},{sleep_time_msec}); </script>Copy the code

The current page refreshes after a certain interval when there is no task

<script>setTimeout(function(){{window.location.reload(); }},{sleep_time_msec}); </script>Copy the code

The reason why we need to refresh after a certain interval when there is no task is to trigger the wechat client’s request to the server, and then the middleman can catch the packet, and then the code logic execution of this module can be triggered to obtain the task again.

data_pipeline.py

This module has nothing to say, is the data into the library, this office omitted

conclusion

The above is the comparative analysis of the technical scheme of the mainstream wechat public account crawler at this stage and the wechat public account crawler github.com/striver-ing… Code analysis. It is recommended to grasp the wechat data package analysis, figure out the wechat public number data request process, to understand this code is very helpful. At present, more than 7.0 mobile wechat seems to be unable to catch the package, you can grab PC terminal or MAC terminal, the protocol is the same.

Wish this share to you have some help, thank you ~

Next time share a preview

Zhilian – Rui anti – climb crack

To learn more

Welcome to Planet Knowledge t.zsxq.com/eEmAeae

This planet focuses on the sharing of crawler technology, and explains the problems and solutions of crawlers in detail through some cases. Knowledge covered includes but is not limited to crawler framework analysis, JS reverse engineering, man-in-the-middle, Selenium, Pyppeteer, Android reverse engineering! We look forward to your joining us to discuss crawler technology and expand crawler thinking!