50 lines of code to crawl wechat public account all articles

[url=]

[/url]

Today’s goal

**50 lines of code to crawl all articles on wechat public account **

Today we are going to climb the wechat public number. There are two common ways to climb the public number. One is to obtain through sogou search, the disadvantage is that can only get the latest ten push articles, today introduces another way to get the public number of articles by capturing the PC end wechat method, more convenient than other methods. Analysis: We found that mp.weixin.qq.com was requested every time the article was pulled down to refresh

/mp/

XXX public account does not allow to add homepage link, XXX stands for profile_ext) interface. After many tests and analyses, the following parameters are used:

__biz

: the unique ID between the user and the public id. Uin: the user’s private idkey: the secret key of the request, which will only be invalid after a period of time. Offset: offset count: number of requests

* Code implementation *

` ` `

import

requests

import

json

import

time

from

pymongo

import

MongoClienturl

‘

http://mp.weixin.qq.com/mp/xxx

‘

(public account does not allow to add homepage link, XXX stands for profile_ext)

Mongo

configuration

conn = MongoClient(

‘

127.0.0.1

‘

, 27017,

)db

= conn.wx

Connect to the WX database, or create it automatically

mongo_wx = db.article

Use the Article collection; if not, it is created automatically

def

get_wx_article(biz, uin, key, index=0, count=10

): offset

= (index + 1) *

count params

{

‘

__biz

‘

: biz,

‘

uin

‘

: uin,

‘

key

‘

: key,

‘

offset

‘

: offset,

‘

count

‘

: count,

‘

action

‘

getmsg

‘

json

‘

} headers

{

‘

User-Agent

‘

Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36

‘

} response

= requests.get(url=url, params=params, headers=

headers) resp_json

response.json()

resp_json.get(

‘

errmsg

‘

) = =

‘

: resp_json

response.json()

Whether there is paging data to determine the value of return

can_msg_continue = resp_json[

‘

can_msg_continue

‘

]

Number of current paged articles

msg_count = resp_json[

‘

msg_count

‘

] general_msg_list

= json.loads(resp_json[

‘

general_msg_list

‘

]) list

= general_msg_list.get(

‘

list

‘

)

(list,

”

* * * * * * * * * * * * * *

”

)

for

list: app_msg_ext_info

= i[

‘

app_msg_ext_info

‘

]

The title

title = app_msg_ext_info[

‘

title

‘

]

This article addresses

content_url = app_msg_ext_info[

‘

content_url

‘

]

cover

cover = app_msg_ext_info[

‘

cover

‘

]

Release time

datetime = i[

‘

comm_msg_info

‘

] [

‘

datetime

‘

] datetime

= time.strftime(

”

%Y-%m-%d %H:%M:%S

”

, time.localtime(datetime)) mongo_wx.insert({

‘

title

‘

: title,

‘

content_url

‘

: content_url,

‘

cover

‘

: cover,

‘

datetime

‘

: datetime })

can_msg_continue == 1

return

True

return

False

else

(

‘

Get article exceptions…

‘

)

return

False

__name__

= =

‘

__main__

‘

: biz

‘

Mzg4MTA2Nzg0NA==

‘

uin

‘

NDIyMTI5NDM1

‘

key

‘

20a680e825f03f1e7f38f326772e54e7dc0fd02ffba17e92730ba3f0a0329c5ed310b0bd55

b3c0b1f122e5896c6261df2eaea4036ab5a5d32dbdbcb0a638f5f3605cf1821decf486bb6eb4d92d36c620

‘

index

while

‘

Start to grab the public number {index + 1} page article.

‘

) flag

= get_wx_article(biz, uin, key, index=

index)

To prevent harmony, pause for 8 seconds

time.sleep(8

) index

+ = 1

not

flag:

(

‘

Public article has been all captured, exit the program.

‘

)

break

‘

. Ready to grab the public number {index + 1} page article.

‘

) ` ` `

[url=]

[/url]

More technical information can be obtained from itheimaGZ

50 lines of code to crawl wechat public account all articles

Related Posts

How to write good code?

Yck- how to make an influential technology booklet

Why is Kubernetes so popular?