50 lines of code to crawl wechat public account all articles
December 28, 2023
by Abigail Baker
No Comments
[url=]
[/url]
#
Today’s goal
**50 lines of code to crawl all articles on wechat public account **
Today we are going to climb the wechat public number. There are two common ways to climb the public number. One is to obtain through sogou search, the disadvantage is that can only get the latest ten push articles, today introduces another way to get the public number of articles by capturing the PC end wechat method, more convenient than other methods. Analysis: We found that mp.weixin.qq.com was requested every time the article was pulled down to refresh
/mp/
XXX public account does not allow to add homepage link, XXX stands for profile_ext) interface. After many tests and analyses, the following parameters are used:
__biz
: the unique ID between the user and the public id. Uin: the user’s private idkey: the secret key of the request, which will only be invalid after a period of time. Offset: offset count: number of requests
* Code implementation *
` ` `
import
requests
import
json
import
time
from
pymongo
import
MongoClienturl
=
‘
http://mp.weixin.qq.com/mp/xxx
‘
(public account does not allow to add homepage link, XXX stands for profile_ext)
#
Mongo
configuration
conn = MongoClient(
‘
127.0.0.1
‘
, 27017,
)db
= conn.wx
#
Connect to the WX database, or create it automatically
mongo_wx = db.article
#
Use the Article collection; if not, it is created automatically
def
get_wx_article(biz, uin, key, index=0, count=10
): offset
= (index + 1) *
count params
=
{
‘
__biz
‘
: biz,
‘
uin
‘
: uin,
‘
key
‘
: key,
‘
offset
‘
: offset,
‘
count
‘
: count,
‘
action
‘
:
‘
getmsg
‘
.
‘
f
‘
:
‘
json
‘
} headers
=
{
‘
User-Agent
‘
:
‘
Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36
‘
} response
= requests.get(url=url, params=params, headers=
headers) resp_json
=
response.json()
if
resp_json.get(
‘
errmsg
‘
) = =
‘
ok
‘
: resp_json
=
response.json()
#
Whether there is paging data to determine the value of return