I always feel that it is very inconvenient to read the public account on the mobile phone. If the quality of some public accounts is relatively high, it is difficult to settle down to study on the mobile phone. Maybe after a while, something else will disturb you or you’ll have to watch short videos. Although there is a computer version of wechat, it feels inconvenient to log in every time.

I would like to see the wechat public number on the computer after the aggregation through RSS. RSS was still very popular a few years ago, but now every major platform has built its own information circle, using recommendation algorithm to build information cocoon for users, RSS is gradually declining, and there are fewer and fewer high-quality public RSS sources, fortunately, there is an open source RSS making tool Huginn can use. Most of the articles on Baidu about making RSS of wechat public account are outdated and cannot be used according to the steps, so in this article I record the process and thinking of making RSS of wechat public account.

Huginn website: github.com/huginn/hugi…

The RSS reader I use is QuiteRSS: quiterss.org/ Different desktop RSS readers parse XML files from RSS feeds differently. I tested QuiteRSS to be compatible with Huginn’s RSS feeds.

1 Preconditions

I need a cloud server of my own. The configuration of my cloud server is: 1 core 2 GiB Ali cloud server

2 install Huginn

Huginn installation can refer to www.cnblogs.com/liujiangblo… This article. It is recommended to use Docker installation, which will save a lot of trouble.

3. Difficulties in making RSS sources for wechat public accounts

The following problems exist in the implementation of their own public RSS source:

  1. How to obtain the article data of wechat public number?
  2. How is this data collected?
  3. How to process the collected data and convert it into RSS feed.

3.1 How to obtain article data of wechat official account?

As far as I know, there are two ways to obtain the public article data

  1. You can check the latest articles of the official account through Sogou wechat weixin.sogou.com/. After clicking on the latest article, the page of the wechat official account will be redirected to the article. This way you can get the latest articles, but you can’t get more than one article at a time.

  2. You can see all the articles of the official account in the wechat client. I think you can use Python script to simulate the mouse click to obtain the URL of the article, but the technical difficulty is too high.

I feel that the first method meets my needs, so I use the first method to collect.

3.2 How to collect the data?

This time the Huginn turn appeared, this article www.cnblogs.com/liujiangblo… The second half of “Cinematography” is a tutorial from Huginn’s latest movie. You can refer to it. In general, Huginn is a tool that can collect and process web information. (FOR the rest of the article, I assume that the reader has mastered the basics of Huginn.)

3.3 How to process and convert the collected data into RSS feed?

Huginn has the ability to convert collected data into RSS feeds.

4 Production Process

All the problems have solutions, so let’s get started. The resulting Huginn Scenarios are shown below, one by one.

4.1 Capture of wechat public accounts: Analysis of Sogou’S wechat page

Huginn actually collects data by grabbing page elements. So we need to analyze the page first, through THE CSS selector to locate the elements we want, using the why god’s public account for example. I need to get three pieces of information: the author, the article name, and the article URL

You can see that the uIGS property can locate the desired information, so the CSS selector is a[uigs=account_article_0]. You can also get it by right-clicking on the A TAB -> Copy -> Copy Styles (or copy CSS selectors).

The CSS selector: www.runoob.com/cssref/css-…

4.2 Writing method of page fetching Agent

In the previous section, we located the element and started writing the Huginn Agent. Let’s create a new WebsiteAgent (WebsiteAgent can grab web information, Huginn has many other agents, you can see it in the official document)

Options of this Agent are written as follows (if you paste mine, you need to remove the comment).

{
  "expected_update_period_in_days": "2"."url": [
  // Focus on the list of public numbers, using an array of ways can simultaneously grab multiple public numbers
    "https://weixin.sogou.com/weixin?type=1&s_from=input&query=hello_hi_why&ie=utf8&_sug_=n&_sug_type_="."https://weixin.sogou.com/weixin?type=1&s_from=input&query=pangmenzd&ie=utf8&_sug_=n&_sug_type_="]."type": "html"."mode": "on_change"."extract": {
    // Inside this is the target we want to extract, title, urlContent are custom fields
    "title": {
    // Huginn collects elements using CSS selector
      "css": "a[uigs=account_article_0]".// Text () means to get the text inside the tag
      "value": "text()"
    },
    "urlcontent": {
      "css": "a[uigs=account_article_0]"."value": "@href"
    },
    "author": {
      "css": "a[uigs=account_name_0]"."value": "text()"}},"template": {
  //template is a way to reprocess the collected data
  // The function of the _response_.url is that sometimes the href doesn't include the domain name, but the domain name is converted to the _response_
    "urlcontent": "{{urlcontent| to_uri: _response_.url }}"."url": "{{url| to_uri: _response_.url }}"}}Copy the code

Once you’ve written it, you can test it by clicking Dry Run. My test results are as follows:

And you can see we’ve got all three of these.

4.3 RSS scheme 1

In fact, at this point, if we don’t want to be perfect, we can publish RSS feeds through Huginn’s DataOutputAgent, and the content in the RSS feed is an A tag.

Access to the point of connection, the use of RSS browser to jump to the page

After the jump, the result is as follows (sometimes man-machine verification will be carried out, and it is ok to input the verification code once instead of many times)

It looks pretty good, I was happy with it when I made it, the jump isn’t pretty, but as with products, sometimes small workarounds can save you a lot of code changes. It wasn’t until the next day that I discovered a huge flaw in this approach: the URlContent here has a timeout, probably a dozen hours. If the link caught today did not come to the urgent look, the next day when visiting all expired link pages. This was unacceptable, so I vetoed RSS option 1.

4.4 RSS scheme 2

After rejecting scheme 1, I thought that if the URL would time out, I could only grab the content of the article and store it on the server to meet my needs. Through my research, I found that Selenium is a tool that can retrieve web information in a way that mimics user action. I tried it locally and it worked (with Chrome). Therefore, my idea of scheme 2 is that huginn will collect URLContent and then visit selenium service set up on the server. Through Selenium, the web information will be captured and then stored and output as RSS source.

I encountered the following problems with this scheme:

  1. How do I access Chrome on Linux without a GRAPHICAL user interface?

I did some research and learned that there is something called XVBF that simulates graphical interfaces on Linux. Installation is also simple

yum install Xvfb -y 
yum install xorg-x11-fonts* -y
Copy the code

After installing XVBF, you also need to install Chrome and the Chromedriver, Python3.10 (The Python version should not be too old, my initial 3.5 installation failed).

For details about the installation process, see blog.csdn.net/weixin_3086…

  1. How do I expose selenium services to Huginn?

It’s easy to set up a Rastful service on a server that huginn and Selenium both access and transfer JSON over HTTP. There is a fastAPI framework in Python that is quite simple, so I used it. Fastapi needs to be deployed with Uvicorn and Gunicorn, as well as exposed externally via the Nginx agent.

4.5 Writing Python services

My Python service code is gitee.com/kagami1/hug… I also learned Python by hand and write it in a mess. The points to note here are:

1 Use non-sandbox mode during startup; otherwise, DevToolsActivePort File Doesn’t exist will be reported

chrome_options.add_argument('--no-sandbox')
Copy the code

2 cannot be started with headless. If started in Headless mode, it will be detected and redirected to the man-machine verification page, so that the page cannot be obtained.

3. At the beginning, I directly accessed the URlContent page and also jumped to the man-machine verification page. So I collected it in a way that completely mimics user behavior, visiting the URL page first and then the URlContent page

driver.get(req.url) 
driver.get(req.urlcontent)
Copy the code

4. The image information is loaded asynchronously, so it takes a short time to get the URL of the image after it is loaded.

Remember to release resources in the 5 finally block. I once failed to release resources and the server almost stuck and exploded

driver.close()
driver.quit() 
display.stop()
Copy the code

6 Install the Python libraries pyVirtualDisplay, FastAPI, PyDantic, Selenium, Uvicorn, and Gunicorn

4.6 Writing method of de-weighting Agent

Because we collect every two hours, we must collect many pages repeatedly. We need to de-duplicate this information. So create a new DeDuplicationAgent, Sources to fill the previous fetching agent

We use title to delete key. Some readers may wonder why we do not use URlContent to delete key. Because URlContent will timeout, it will change after timeout, so it cannot locate the only article. Title may also repeat, but this probability is relatively small and I choose to ignore it.

4.7 Wechat official account content acquisition

This Agent is a FastAPI Agent that interacts with Huginn. The type of the Agent is PostAgent. The function of the Agent is to send post requests

{
  "post_url": "Your IP http:// : 8000 / get_html"."expected_receive_period_in_days": "1"."content_type": "json"."method": "post"."payload": {
    "url": "{{url}}"."urlcontent": "{{urlcontent}}"
  },
  "headers": {},
  "emit_events": "true"."no_merge": "true"."output_mode": "merge"
}
Copy the code

You can see that the message returned by FastAPI is placed in the body field

4.8 JSON Analysis of wechat Public Account

JsonParseAgent: JsonParseAgent: JsonParseAgent: JsonParseAgent

The result of the analysis

4.9 Final result of wechat public account

This Agent is responsible for traversing the imGS of the previous Agent and assembling them into IMG tags, as well as assembling links to the original text (some articles can be very long and have many images, so it is better to go to the original text if the URlContent is still valid). This agent is JavaScriptAgent and can write JS freely

Agent.receive = function() {
  var events = this.incomingEvents();
  var imgs = events[0].payload.data.imgs
  var list = []
  for(var key in imgs) {
    list.push("<img src='"+imgs[key]+"'/><br/>")}var text = events[0].payload.data.text.replace(/#/g."<br/>")
  var url = events[0].payload.url
  var atag = "<a href='"+url+

this.createEvent({ 'atag': atag, 'url': url,'author': events[0].payload.author, 'title': events[0].payload.title,'text': text,'imgs':list}); } Copy the code

The running result of this agent is

4.10 output of wechat public account

Finally, we have all the information we want, and we can output the RSS feed. This Agent is DataOutputAgent and its options are written like this

In the title field, I spliced the author name and article name for easy identification. In the description field, I put the link to the original text at the beginning, followed by the text and pictures of the article

4.11 Final Effect

The picture follows the article

This isn’t perfect either, because the text should appear in sequence with the image. But I’m not very good at Python and the Spring Festival is short. So I used selenium’s native Text method, which directly fetches all the text in an element. In fact, if you walk through this element, you can certainly adjust the image and text to appear in order.

return resp_200(data={
    "imgs": list."text": article.text.replace("\n"."\\n")})Copy the code

But now it’s all I need. At least there is a link to the original.

4.12 RSS scheme 3

When I was browsing the SELENIUM API, I found that Selenium has an API for saving elements as images

At this time, I came up with the idea of scheme 3. In fact, we can splice the screenshots of the article page by page by simulating clicking the drop-down box. This preserves not only the order of the text and images, but also the style of the text. It’s more troublesome, but I think it’s better. Since my server only had 1M bandwidth and loading images was slow, I didn’t consider option 3.

Another point is that the problem of both plan 2 and Plan 3 is that they cannot handle video links, which needs to be improved in the future.

5 Last thing I want to say

During the Spring Festival, I did not go home, so I learned Huginn in idle and boring way. Although it was useless, I was still very happy during the learning process, which was much happier than being a CRUD boy in my work unit every day. I’m not really good at JS, Python and CSS, but I’m always trying to learn things without understanding them. Finally, LITTLE by little, I achieved my goal, and it was a lot of fun.

Selenium is widely used in Python crawlers, and I actually think it’s a great way to do front-end testing. I feel that most of the company’s front-end testing is manual black box testing (commonly known as cow and horse testing), or less automation.

The purpose of writing this article is to let you know that Huginn can do a lot of things. Please refer to the examples on the official website.