Hi, I’m Latiao, and this is the 25th installment in my crawler series.
Today is a we media treasure site.
The editing zone
The editing area contains all the functions of article editing, the key is that all functions and materials are free, like you write blog or wechat public number have used this kind of similar editor, but there are few free.
Writing robot
This function is a lot of we media small white welfare, want to write? But I have no writing skills, this magic can help you describe the article, highly recommended.
What’s worth writing about
This feature is also very powerful. If you want to write about technology, for example, what’s worth writing about will suggest the latest hot articles in that genre.
Here only recommend three functions, do not write a length to introduce other functions, you can explore, directly into today’s topic, climb what is worth writing recommended hot list articles, learn how others engaged in “we media” blog.
Acquisition target
Website: New list editor
Tool use
Development tools: pycharm development environment: python3.7, Windows10 using toolkit: requests, execjs
Project idea analysis
Although today’s web page is more practical, you can also choose the right article content in the current web page to write, but the problem is what, web data is encrypted, latiao is really sleepless.
First, crawler the general operation to find the interface position of the target data, open the packet capture tool to check the loading mode of the data, open the packet capture tool the moment the data disappeared
The current page should be a developer tool detection, the capture tool to open a separate page
It’s not a problem to open it separately, find the file interface data and determine the url of the request, okay
https://edit.newrank.cn/xdnphb/editor/articleMaterial/searchArticleMaterial data interfacesCopy the code
The current request mode of the interface is POST, which means that data needs to be passed
Item can clearly see the format, date and type of data loading. Data page turning is carried out by date. Nonece and XYZ are encrypted data, and it is simple to judge that nonece and XYZ should be md5 encrypted. You can use global search or XHR breakpoint debugging, as long as you can find the encryption location of the data on the line, latiao here directly through XHR breakpoint to find the location of the data sent
Use the call Stack on the right of the packet capture tool to find out the subsequent execution process. Click on it one by one to find out where the data that was sent was generated. The data that was passed was H
Place breakpoints on the location data to see the loading rule of the data. Nonece is the random value of the 9 parameters generated by the U method
The xyz encrypted data is o plus the values of &nonece and nonece, and the argument to O is the url plus the appKey plus the value of item
"/xdnphb/editor/articleMaterial/searchArticleMaterial? AppKey = joker&item = {" type ":" lakh ", "period" : "1 # 2021-09-18", "order" : "2", "extra" : "all", "ranklist_id" : ""," weixin_id ":" ", "start _time":"2021-09-18"}&nonce=6a65cad87"Copy the code
Xyz encryption code is a bit too much, what a god
Directly fill the environment, not to hard button JS code copy JS code to the local, the encryption function to take over, the whole t function to the local, run their own try
Finished, data encryption, encryption rules are solved, now use Python integration, send a request to the target url, obtain data for saving, you big guy JS source code can be deducted, as long as the T function can be.
Results show
Easy source sharing
import execjs import requests import csv nonce = execjs.compile(open('nonce.js').read()).call('u') date = Input (' Please enter the date you need (2021-07-19) : ') xyz_code = '/xdnphb/editor/articleMaterial/searchArticleMaterial?AppKey=joker&item={"type":"lakh","period":"1#%s","order":"2","extr A ":" all ", "ranklist_id" : ""," weixin_id ":" ", "start_time" : "2021-09-04"} & nonce = % s' % (the date, nonce) print(xyz_code) xyz = execjs.compile(open('nonce.js').read()).call('t', xyz_code) print(xyz) url = "https://edit.newrank.cn/xdnphb/editor/articleMaterial/searchArticleMaterial" data = { 'item': '{" type ":" lakh ", "period" : "1 # % s", "order" : "2", "extra" : "all", "ranklist_id" : ""," weixin_id ":" ", "start_time" : "2021-09-04"}' % Date, 'nonce': nonce, 'xyz': xyz} headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36', 'referer': 'https://edit.newrank.cn/?module=article', } response = requests.post(url, headers=headers, Json () datas = response['value']['datas'] # print(response) with open(' CSV ', 'a', newline=', encoding='utf-8')as f: for data in datas: # print(data) csv_data = csv.DictWriter(f, fieldnames=['summary', 'publicTime', 'originalFlag', 'author', 'orderNum', 'likeCount', 'clicksCount', 'downloadStatus', 'title', 'type', 'url']) csv_data.writerow(data)Copy the code
Industry information: PPT template, resume template, industry classic books PDF can be added. Interview question bank: classic, hot big factory interview questions over the years, continuous update, add access. Learning materials: including Python, crawler, data analysis, algorithms and other learning videos and documents, add access to communication plus group: big guy guide maze, your problems are often encountered, technical mutual assistance exchange.