Python library Feedparser+Atom feeds

background

Recently, when I was beautifying the GitHub homepage, I made some interesting things for those who are interested

Here I stick a home page address: github.com/JS-banana, interested can see ~

When I was editing my personal information, I had an idea: could I synchronize my blog status updates on my GitHub homepage?

When I update my blog, my GitHub homepage automatically syncs the latest updates to my blog

It was an idea that came up at the time, and I looked into it later. Originally, I wanted to use NodeJS to write a crawler. There are no problems with that, but there are a lot of defects. I can only build a semi-finished crawler myself, and it does not have certain reuse, so I ruled it out

Then I looked at Python’s FeedParser library and thought it was a good fit. (FeedParser is the most common RSS library in Python, making it easy to get headlines, links, and article entries from any RSS or Atom feed.)

I also looked at the effect and felt very good, so we only need to do two things:

implementationThe Atom feeds(forfeedparserLibrary use)
implementationREADME.mdDynamic update of files (update home page after receiving subscription information)

RSS, Atom feeds

RSS feeds should be familiar to most of us. When we look at a lot of big blogs and popular websites and services, we find that they all offer RSS/Atom feeds. So what is RSS? What is Atom?

What is RSS?

Really Simple Syndication
Gives you the ability to syndicate the content of your web site
Defines very simple methods to share and view titles and content
Files can be updated automatically
Allows views to be personalized for different web sites
useXMLwrite

Why use RSS?

RSS is designed to display selected data.

Without RSS, users would have to come to your site every day to check for new content. This is too time consuming for many users. With RSS feeds (often referred to as News feeds or RSS feeds), users can use RSS aggregators (sites or software that aggregate and categorize RSS feeds) to check your website updates more quickly.

The future of RSS (Birth of Atom)

The future of the protocol is uncertain because of copyright issues with RSS 2.0

With RSS’s future uncertain and the development of the RSS standard having many problems or shortcomings, ATOM can be understood as a simple alternative to RSS.

What is the FEED

A FEED is essentially a “middleman” between RSS (or ATOM) and subscribers, helping to deliver information wholesale. So, the common formats for feeds are RSS and ATOM, and FEED subscriptions are still better known on the Web as RSS or ATOM subscriptions.

What is a subscription

Subscribe to similar to ordinary people subscribe to newspapers and magazines, but almost all site RSS/ATOM subscriptions are free, there are also some how much do you charge for subscription to “non-mainstream” gens, FEED, of course, just on the network information transmission, generally do not involve physical data transfer, so you met like website, and also like to use the online or offline reading, can subscribe, And you can unsubscribe at any time.

conclusion

RSS and Atom have a similar XML-based format. The basic structure is the same, with a slight difference in the expression of the nodes. All we need to know is that ATOM is an improvement over RSS2.0.

Generate Atom subscriptions for your site

Atom subscription base structure

To understand the basic format and syntax of atom.xml, watch a simple demo

<! -- Header information -->

      

<! - main body - - >
<feed xmlns="http://www.w3.org/2005/Atom">
  <! -- Basic information -->
  <title>Small handsome technology blog</title>
  <link href="https://ssscode.com/atom.xml" rel="self"/>
  <link href="https://ssscode.com/"/>
  <updated>The 2021-08-28 16:25:56</updated>
  <id>https://ssscode.com/</id>
  <author>
    <name>JS-banana</name>
    <email>[email protected]</email>
  </author>

  <! -- Content area -->
  <entry>
    <title>Webpack + React + TypeScript builds a standardized application</title>
    <link href="https://ssscode.com/pages/c3ea73/" />
    <id>https://ssscode.com/pages/c3ea73/</id>
    <published>The 2021-08-28 16:25:56</published>
    <update>The 2021-08-28 16:25:56</update>
    <content type="html"></content>
    <summary type="html"></summary>
    <category term="webpack" scheme="https://ssscode.com/categories/?category=JavaScript"/>
  </entry>

  <entry>.</entry>.</feed>
Copy the code

The basic information piece can be customized, and then, after going to the end, we can find that we only care about

…
tag content, that is, the basic information of each blog post ~

Therefore, we can generate atom.xml ourselves as long as we follow the specification, format, and syntax, nice😎~

If you don’t want to write your own, try this feed

Write atom.xml file generating functions

Since my blog is built on Vuepress (webpack + vue2.x), I’ll use NodeJS as an example

Read all the markdwon files without going into details, we get all the list data, do a simple processing, here just fill in some data we need, if you want to read the feed, you can also enrich the information content ~

const DATA_FORMAT = 'YYYY-MM-DD HH:mm:ss';

// posts is all the blog post information
// The ampersand in XML needs to be replaced with & Otherwise there will be syntax errors
function toXml(posts) {
  const feed = ` <? The XML version = "1.0" encoding = "utf-8"? > < feed XMLNS = "http://www.w3.org/2005/Atom" > < title > small handsome の technology blog < / title > < link href = "https://ssscode.com/atom.xml" rel="self"/> <link href="https://ssscode.com/"/> <updated>${dayjs().format(DATA_FORMAT)}</updated>
    <id>https://ssscode.com/</id>
    <author>
      <name>JS-banana</name>
      <email>[email protected]</email>
    </author>
    ${posts
      .map(item => {
        return `
        <entry>
          <title>${item.title.replace(/(&)/g.'& ')}</title>
          <link href="https://ssscode.com${item.permalink}" />
          <id>https://ssscode.com${item.permalink}</id>
          <published>${item.date.slice(0.10)}</published>
          <update>${item.date}</update>
        </entry>`;
      })
      .join('\n')}
  </feed>`;

  fs.writeFile(path.resolve(process.cwd(), './atom.xml'), feed, function(err) {
    if (err) return console.log(err);
    console.log('File write succeeded! ');
  });
}
Copy the code

Node executes this file, which should generate an atom.xml file in its sibling directory, as you can see

Ok, atom subscription source done ~

Simple usage of FeedParser

Python FeedParser – There is a Node version of Python FeedParser on the web

Copy the demo snippet to atom.xml and test the usage briefly. Take a look at the return value format. In order to see the structure more clearly, I have processed the result of the Python execution

The atom XML source file


      
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Small handsome technology blog</title>
  <link href="https://ssscode.com/atom.xml" rel="self"/>
  <link href="https://ssscode.com/"/>
  <updated>The 2021-08-28 16:25:56</updated>
  <id>https://ssscode.com/</id>
  <author>
    <name>JS-banana</name>
    <email>[email protected]</email>
  </author>
  <entry>
    <title>Webpack + React + TypeScript builds a standardized application</title>
    <link href="https://ssscode.com/pages/c3ea73/" />
    <id>https://ssscode.com/pages/c3ea73/</id>
    <published>The 2021-08-28 16:25:56</published>
    <update>The 2021-08-28 16:25:56</update>
  </entry>
</feed>
Copy the code

The main py script

import feedparser

blog_feed_url = "./atom.xml"

feeds = feedparser.parse(blog_feed_url)

print (feeds)
Copy the code

The general structure of the output is as follows

{
  bozo: 1.// entries
  entries: [{title: "Webpack + React + TypeScript builds a standardized application".title_detail: {
        type: "text/dplain".language: None,
        base: "".value: "Webpack + React + TypeScript builds a standardized application",},links: [{ href: "https://ssscode.com/pages/c3ea73/".rel: "alternate".type: "text/html"}].link: "https://ssscode.com/pages2/c3ea73/".id: "https://ssscode.com/pages/c3ea73/".guidislink: False,
      published: "The 2021-08-28 16:25:56".publoished_parsed: time.struct_time(), // This is a date-handling function
      update: "The 2021-08-28 16:25:56"],},// feed
  feed: {
    title: "Xiaosai Technology Blog".title_detail: { type: "text/plain".language: None, base: "".value: "Xiaosai Technology Blog" },
    links: [{href: "https://ssscode.com/atom.xml".rel: "self".type: "application/atom+xml" },
      { href: "https://ssscode.com/".rel: "alternate".type: "text/html"},].link: "https://ssscode.com/".updated: "The 2021-08-28 16:25:56".updated_parsed: time.struct_time(),
    id: "https://ssscode.com/".guidislink: False,
    authors: [{ name: "JS-banana".email: "[email protected]"}].author_detail: { name: "JS-banana".email: "[email protected]" },
    author: "JS-banana ([email protected])",},headers: {},
  encoding: "utf-8".version: "atom10".bozo_exception: SAXParseException("XML or text declaration not at start of entity"),
  namespaces: { "": "http://www.w3.org/2005/Atom"}},Copy the code

As you can see, just get all the entries and write a function to get the content we need

def fetch_blog_entries() :
    entries = feedparser.parse(blog_feed_url)["entries"]
    return[{"title": entry["title"]."url": entry["link"].split("#") [0]."published": entry["published"].split("T") [0],}for entry in entries
    ]
Copy the code

Replaces the markdown file with the specified area content

The last step is: how to replace the area specified in our readme. md home file, and then push to GitHub to complete the update

### Hello, I'm xiao Shuai! 👋. . Other information <! -- start --> This displays the blog information <! -- end -->Copy the code

As mentioned above, no changes are required except for the designated areas that need to be updated

At this point, you can use Python to read the comment and use the re to process the replacement

We mark annotations in readme.md

<! -- blog starts --> ... <! -- blog ends -->Copy the code

Code:

def replace_chunk(content, marker, chunk, inline=False) :
    r = re.compile(
        r"
      .*
      ".format(marker, marker),
        re.DOTALL,
    )
    if not inline:
        chunk = "\n{}\n".format(chunk)
    chunk = "<! -- {} starts -->{}<! -- {} ends -->".format(marker, chunk, marker)
    return r.sub(chunk, content)
Copy the code

Finally, combined with interface request, file reading, the complete code is as follows

import feedparser
import json
import pathlib
import re
import os
import datetime

blog_feed_url = "https://ssscode.com/atom.xml"

root = pathlib.Path(__file__).parent.resolve()

def replace_chunk(content, marker, chunk, inline=False) :
    r = re.compile(
        r"
      .*
      ".format(marker, marker),
        re.DOTALL,
    )
    if not inline:
        chunk = "\n{}\n".format(chunk)
    chunk = "<! -- {} starts -->{}<! -- {} ends -->".format(marker, chunk, marker)
    return r.sub(chunk, content)

def fetch_blog_entries() :
    entries = feedparser.parse(blog_feed_url)["entries"]
    return[{"title": entry["title"]."url": entry["link"].split("#") [0]."published": entry["published"].split("T") [0],}for entry in entries
    ]

if __name__ == "__main__":
    readme = root / "README.md"
    readme_contents = readme.open(encoding='UTF-8').read()

    entries = fetch_blog_entries()[:5]
    entries_md = "\n".join(
        ["* <a href='{url}' target='_blank'>{title}</a> - {published}".format(**entry) for entry in entries]
    )
    rewritten = replace_chunk(readme_contents, "blog", entries_md)

    readme.open("w", encoding='UTF-8').write(rewritten)
Copy the code

I’m not familiar with Python either, but I can follow in the footsteps of others and use it to achieve the desired effect

Recently, I touched some Python related script library, and found it quite interesting. I think it is necessary to learn it, and it is very helpful in daily use. After all, Python is very popular now, even as a tool, it feels very powerful

Example Configure GitHub Action scheduled tasks

The script to implement the functionality is done, and now we want it to execute automatically after we finish updating the blog

Here we use GitHub Action’s scheduled task directly

Add the file.github/workflows/ci.yml to the project

name: Build README

on:
  workflow_dispatch:
  schedule:
    - cron: "30 0 * * *" Run at 0:30 every day, Beijing time needs + 8

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Check out repo Get the code branch
        uses: actions/checkout@v2

      - name: Set up Python # python environment
        uses: actions/setup-python@v2
        with:
          python-version: 3.8

      - uses: actions/cache@v2 # dependency cache
        name: Configure pip caching
        with:
          path: ~/.cache/pip
          key: The ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
          restore-keys: | ${{ runner.os }}-pip-
      - name: Install Python dependencies # install dependencies
        run: | python -m pip install -r requirements.txt
      - name: Update README # execute script
        run: |- python build_readme.py cat README.md
      - name: Commit and push if changed # Git commit
        run: |- git diff git config --global user.email "[email protected]" git config --global user.name "JS-banana" git pull git add -A git commit -m "Updated README content" || exit 0 git pushCopy the code

Done ~

Take a look at the effect:

The script will run once a day to synchronize information about the blog

conclusion

I only knew about RSS feeds before, I didn’t know about all these details, but this time I sorted out some of them and tried to play by myself, which was pretty good

It feels great to know more than one language, sometimes it will give you a completely different way of thinking, and maybe a better solution

Help me up. I can still learn to laugh

reference

Subscription base: RSS, ATOM, feeds, syndication, feeds, syndication, and subscriptions
What’s the difference between RSS,ATOM, and FEED
feedparser
jasonkayzk

Python library Feedparser+Atom feeds

background

RSS, Atom feeds

What is RSS?

Why use RSS?

The future of RSS (Birth of Atom)

What is the FEED

What is a subscription

conclusion

Generate Atom subscriptions for your site

Atom subscription base structure

Write atom.xml file generating functions

Simple usage of FeedParser

Replaces the markdown file with the specified area content

Example Configure GitHub Action scheduled tasks

conclusion

reference

Related Posts

React Navigation

3 tips for easily viewing the port numbers of multiple emulators

Some style changes to el-Input in element UI