Many wechat public numbers provide high quality article reading, for their favorite wechat public number, so I want to do a wechat public number crawler, climb all the articles of the relevant public number. Crawl all the articles of the public number, need to obtain two more important parameters. One is the unique ID of the wechat public account (__biz) and the article permission value of the single public account wap_sid2. So let’s talk about the idea.
Crawl train of thought:
In order to obtain the crawler of the wechat public account, the first step is to uniquely identify the wechat public account, so the ID value (__biz) of the wechat public account should be obtained. Read more relevant articles, a lot of access to the value of __biz mechanical, pure manual copy take __biz; Now sogou engine and wechat public number docking, for us to provide a good way to get, wechat public number source code there is the number of __biz value (can be obtained from this way); But sogou engine has a limit on wechat public number, only show the last 10 articles, so we simply get the __biz value from sogou engine and search any keyword public number list through Sogou.
The following is the URL address of sogou’s wechat public account search, in which query Python is the keyword of search, the other can be unchanged.
http://weixin.sogou.com/weixin?type=1&s_from=input&query=python&ie=utf8&_sug_=n&_sug_type_=
Copy the code
Search results page:
View source code
You can find in the source code of each public number link, is located in the id of sougou_vr_11002301_box_n(n is an integer such as 1,2,3, etc.) under the href attribute value of a tag. It can be obtained by xpath syntax, where the position of n can be obtained in a regular order:
//*[@id="sogou_vr_11002301_box_n"]/div/div[2]/p[1]/a
Copy the code
The address of a single public account was obtained as follows:
http://mp.weixin.qq.com/profile?src=3×tamp=1508003829&ver=1&signature=Eu9LOYSA47p6WE0mojhMtFR-gSr7zsQOYo6*w5VxrUgy7 RbCsdkuzfFQ1RiSgM3i9buMZPrYzmOne6mJxCtW*g==Copy the code
Open the link of a single official account to obtain the source code of the official account, and take the ID value of the wechat official account:
// The biz value is the unique id value of the wechat official number. Most of the code is omitted before and after; This code is in the script tag; The code also contains data for the last 10 articles. If you simply want to get the last 10 articles, you can use a regular expression to get var biz = directly"MzIwNDA1OTM4NQ==" || "";
var src = "3" ;
var ver = "1" ;
var timestamp = "1508003829" ;
var signature = "Eu9LOYSA47p6WE0mojhMtFR-gSr7zsQOYo6*w5VxrUgy7RbCsdkuzfFQ1RiSgM3i9buMZPrYzmOne6mJxCtW*g==" ;
var name="python6359"||"python";
Copy the code
After obtaining the ID value of wechat public account, it is necessary to obtain the WAP_SID value (that is, the article permission value of a single wechat public account). This part is obtained from wechat client, and then obtained from Fiddler packet capture tool. If you do not know the environment of the packet capture tool, you can refer to Fiddler to capture mobike data packets
Url to obtain the permission value of the wechat official account article:
GET /mp/profile_ext? Action = home & __biz = MjM5MDI1ODUyMA = = & scene = 124 & devicetype = iOS10.0.1 & version = 16051220 & lang = zh_CN & nettype = 8 scene WIFI&a = 3 & fon TScale = 100 & pass_ticket = ji % 2 b3jba2nnexgwdncoia91sbgwdmsmhsdzhhp5eo % 2 bgun % 2 by2v3lxc34gqy3w5u8me & wx_header = 1 HTTP / 1.1Copy the code
The corresponding request header, wherein x-Weike-key is changed at intervals, so it needs to be changed at regular intervals; X-weike-uin can remain unchanged. The pass_ticket can also remain unchanged for a period of time:
'Host':'mp.weixin.qq.com'.# 'X-WECHAT-KEY': 'a83687cde3ca46be517cdbcba60732159f229a03507e9afa1e0dfee00e3cf00562aee022e84b9011924fdbb0c7af8c647c33b1338b11ebdc8893d5d f41dd34a536e1af5b48d15c87b4aef629ad8685f3',
'X-WECHAT-KEY': '33c1fdebcfc1d1ecd9df5003dc9d9ccb6a1f5458eb704e58a05e80c73e8793dede6b52115a74a515d4d12c9a6f2d8f00238afe17cca3635d80d661a 612a4a0bf48a2547516b12030efd8a224548636d2'.'X-WECHAT-UIN':'MTU2MzIxNjQwMQ%3D%3D'.'Accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'User-Agent':'the Mozilla / 5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 Like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN'.'Accept-Language':'zh-cn'.'Accept-Encoding':'gzip, deflate'.'Connection':'keep-alive'.'Cookie':'wxuin=1563216401; pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU; ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=; pgv_pvid=7103943278; sd_cookie_crttime=1501115135519; sd_userid=8661501115135519; 3g_guest_id=-8872936809911279616; tvfe_boss_uuid=8ed9ed1b3a838836; mobileUV=1_15c8d374ca8_da9c8; pgv_pvi=8005854208'.'Referer':"Https://mp.weixin.qq.com/mp/getmasssendmsg?__biz=MjM5MzI5MTQ1Mg==&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nett ype=WIFI&ascene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx_header=1 "
Copy the code
Wap_sid2 = wap_sid2 = wap_sid2 = wap_sid2 = wap_sid2
HTTP/1.1 200 OK Content-Type: text/ HTML; charset=UTF-8 Cache-Control: no-cache, must-revalidate Strict-Transport-Security: max-age=15552000 Set-Cookie: wxuin=1563216401; Path=/; HttpOnly Set-Cookie: pass_ticket=ji+3JbA2NNExGwdNCoIa91sbgwDmSmHsdZhHP5eo+gun+y2V3lxc34GQy3W5u8mE; Path=/; HttpOnly Set-Cookie: wap_sid2=CJGUs+kFElxER01KN1ZkVElJMUdhTktDUUk2LUZHNkFwT1Rzc1EwUWpWaW5ZMHlFQi15cUo1VWFjamNLM3pjdzNCbDc2ZFZpOW0xeDdPb0czWXN uQUdmbVdyOFZiNTREQUFBfjC+7YvPBTgMQJRO; Path=/; HttpOnly Connection: keep-alive Content-Length: 37211Copy the code
Get the public number list data
To obtainwap_sid2
Permission values
The public id value __biz and permission value wap_sid2 are obtained. We can then construct a request for a list of articles. The mongodb operation is to obtain the public id value, and then according to the ID value, obtain the wap_SID2 value, and then put the ID value and wap_SID2 into the library.
# -*- coding: utf-8 -*-
from scrapy import Spider,Request
from .mongo import MongoOperate
import re
from wechatSpider.items import GetsessionspiderItem
from .settings import *
class GetsessionSpider(Spider):
name = "getSession"
allowed_domains = ["mp.weixin.qq.com"]
start_urls = ['https://mp.weixin.qq.com/']
headers={
'Host':'mp.weixin.qq.com'.# 'X-WECHAT-KEY': 'a83687cde3ca46be517cdbcba60732159f229a03507e9afa1e0dfee00e3cf00562aee022e84b9011924fdbb0c7af8c647c33b1338b11ebdc8893d5d f41dd34a536e1af5b48d15c87b4aef629ad8685f3',
'X-WECHAT-KEY': '33c1fdebcfc1d1ecd9df5003dc9d9ccb6a1f5458eb704e58a05e80c73e8793dede6b52115a74a515d4d12c9a6f2d8f00238afe17cca3635d80d661a 612a4a0bf48a2547516b12030efd8a224548636d2'.'X-WECHAT-UIN':'MTU2MzIxNjQwMQ%3D%3D'.'Accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'User-Agent':'the Mozilla / 5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 Like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN'.'Accept-Language':'zh-cn'.'Accept-Encoding':'gzip, deflate'.'Connection':'keep-alive'.'Cookie':'wxuin=1563216401; pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU; ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=; pgv_pvid=7103943278; sd_cookie_crttime=1501115135519; sd_userid=8661501115135519; 3g_guest_id=-8872936809911279616; tvfe_boss_uuid=8ed9ed1b3a838836; mobileUV=1_15c8d374ca8_da9c8; pgv_pvi=8005854208'.'Referer':"Https://mp.weixin.qq.com/mp/getmasssendmsg?__biz=MjM5MzI5MTQ1Mg==&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nett ype=WIFI&ascene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx_header=1 "
}
Wap_sid2 = wap_sid2 = wap_sid2
url="Https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz= {biz} & scene = 124 & devicetype = iOS10.0.1 & version = 16051220 & lang = useful _CN&nettype=WIFI&a8scene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx _header=1"
def start_requests(self):
MongoObj=MongoOperate(MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS,WECHATID)
MongoObj.connect()
items=MongoObj.finddata()
for item in items:
biz=item["wechatID"]
yield Request(url=self.url.format(biz=biz),dont_filter=True,headers=self.headers,callback=self.parse,meta={"proxy":"http://127.0.0.1:8888"."biz":biz})
def parse(self, response):
item=GetsessionspiderItem()
data=response.headers
needCon=data["Set-Cookie"]
wap=needCon.decode("utf-8")
wap=wap.split('; ')
wap=wap[0].split('=')
wap_sid2=wap[1]
print(wap_sid2)
item["biz"]=response.request.meta["biz"]
item["wap_sid2"]=str(wap_sid2)
yield item
# print(item)
Copy the code
Get the list of article data
The id value of a public account and the corresponding wap_sid2 value are saved in mongoDB, and then the value of the request article is constructed, which is also to obtain the url of the public account article list.
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from .mongo import MongoOperate
import json
from .settings import *
class DataSpider(scrapy.Spider):
name = "data"
allowed_domains = ["mp.weixin.qq.com"]
start_urls = ['https://mp.weixin.qq.com/']
count=10
url="https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={biz}&f=json&offset={index}&count=10&is_ok=1&scene=124&uin= 777&key=777&pass_ticket=ULeI%2BILkTLA2IpuIDqbIla4jG6zBTm1jj75UIZCgIUAFzOX29YQeTm5UKYuXU6JY&wxtoken=&appmsg_token=925_%25 2B4oEmoVo6AFzfOotcwPrPnBvKbEdnLNzg5mK8Q~~&x5=0&f=json"
def start_requests(self):
MongoObj=MongoOperate(MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS,RESPONSE)
MongoObj.connect()
items=MongoObj.finddata()
for item in items:
headers={
'Accept-Encoding':'gzip, deflate'.'Connection':'keep-alive'.'Accept':'* / *'.'User-Agent': 'the Mozilla / 5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 Like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN'.'Accept-Language': 'zh-cn'.'X-Requested-With': 'XMLHttpRequest'.'X-WECHAT-KEY': '62526065241838a5d44f7e7e14d5ffa3e87f079dc50a66e615fe9b6169c8fdde0f7b9f36f3897212092d73a3a223ffd21514b690dd8503b774918d8 e86dfabbf46d1aedb66a2c7d29b8cc4f017eadee6'.'X-WECHAT-UIN': 'MTU2MzIxNjQwMQ%3D%3D'.'Cookie':'; wxuin=1563216401; pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU; ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=; pgv_pvid=7103943278; sd_cookie_crttime=1501115135519; sd_userid=8661501115135519; 3g_guest_id=-8872936809911279616; tvfe_boss_uuid=8ed9ed1b3a838836; mobileUV=1_15c8d374ca8_da9c8; pgv_pvi=8005854208'
}
biz=item["biz"]
# primary validation is wap_sid2; It doesn't matter if pass_ticket is different
headers["Cookie"] ="wap_sid2="+item["wap_sid2"]+headers["Cookie"]
yield Request(url=self.url.format(biz=biz,index="10"),headers=headers,callback=self.parse,dont_filter=True,meta={"biz":biz,"headers":headers},)
def parse(self, response):
biz=response.request.meta["biz"]
headers=response.request.meta["headers"]
resText=json.loads(response.text)
print(resText)
list=json.loads(resText["general_msg_list"])
print(list)
yield list
if resText["can_msg_continue"] = = 1: self.count=self.count+10 yield Request(url=self.url.format(biz=biz,index=str(self.count)),headers=headers,callback=self.parse,dont_filter=True,meta={"biz":biz,"headers":headers})
else:
print("end")
Copy the code
The obtained data is shown in the figure below:
To obtainwap_sid
Another way to think about it
In the process of crawling, sometimes after packet capture, want to get a redirected web page response header; However, the response header cookies have been set to read only. If we want to obtain the permission value here, we can set Fiddler’s rules to generate and save the response file. In the process of wechat article crawling, although also want to get permission value in this way. However, I found that I ignored the request header X-weike-key and x-Weike-uin, so I could not get it. So this approach is not needed in this project. But to provide a dynamic setting cookies value, and then redirected to the new page response header method, such as obtaining mp.weixin.qq.com/mp/profile_…
Add the following code to Fiddler and generate a 2.txt file on your desktop that holds the returned response header:
static function OnBeforeResponse(oSession: Session) {
if (oSession.HostnameIs("mp.weixin.qq.com") && oSession.uriContains("/mp/profile_ext? action=home")) {
oSession["ui-color"] = "orange";
oSession.SaveResponse("C:\\Users\\Administrator\\Desktop\\2.txt".false);
//oSession.SaveResponseBody("C:\\Users\\Administrator\\Desktop\\1.txt")}if (m_Hide304s && oSession.responseCode == 304) {
oSession["ui-hide"] = "true"; }}Copy the code
Github address: github.com/Harhao/wech… Reference article: wechat client public number crawler