This article is participating in Python Theme Month. See the link to the event for more details

preface

The Nuggets have been reading a lot of good columns from all of you, and I’ve been thinking about how to put these columns together.

Yeah, generating PDF files, aggregating data for you, me, and others. Don’t worry! The methods are simple, with less than 200 lines of code.

Source: JueJinColToPdf

Online preview: My Nuggets front end weekly column. PDF

Results demonstrate

Code execution

PDF Document Effects

The basic idea

  1. Collect API information and associations
  2. Download column data
  3. Generate HTML code (MarkDown to HTML)
  4. Generate PDF files

Collect API information and associations

Data is the root of everything, so let’s start by analyzing how to get nuggets column data.

In fact, three apis suffice:

  1. Column summary information function: obtain column summary information, mainly column ID and column name API address: api.juejin.cn/content_api…

  2. Column list function: get column list, mainly the article ID, a middleman. API address: api.juejin.cn/content_api…

  3. Article details function: obtain article details, mainly is the article title and MarkDown format article content API address: api.juejin.cn/content_api…

The overall process is as follows:

Get column information

API address: api.juejin.cn/content_api…

The request parameter QueryString

  1. Column_id column ID

One useful data structure is the Title field, which is needed to generate HTML and PDF.

{column_id:""6979380367216082957" data: { "column_version": {"title":"Front-end basic progression"// Column title}}}Copy the code

Gets a list of columns

API address https://api.juejin.cn/content_api/v1/column/articles_cursor

Request parameters

{
    "column_id":"6979380367216082957"./ / column ID
    "cursor":"0".// Current subscript
    "limit":20.// One page size
    "sort":0  // 0 Publish events from near to far, and vice versa
}  
Copy the code

The useful data structure is actually artcile_id

{
   "err_no": 0."err_msg": "success"."cursor": "2"."count": 5."has_more": false."data": [{"article_id": "6989391487200919566". }}]Copy the code

Get article details

API address: https://api.juejin.cn/content_api/v1/article/detail

Request parameters:

{
    "article_id""6989391487200919566" / / article ID
}    
Copy the code

The result: the mark_content field in the data attribute is the original content of the MarkDown syntax.

{
    "data": {
        "article_id": "6989391487200919566"."article_info": {
            "mark_content": "---\ntheme: channing-cyan\nhighlight: A11y - dark - \ n \ n \ n \ n# # preface \ n \ n this paper, included in * * [front-end based advanced] (https://juejin.cn/column/6979380367216082957) * * column"}},"err_no": 0."err_msg": "success",}Copy the code

Download column data

We chose Python’s well-known requests library, which doesn’t support asynchrony, but that’s fine.

Basic order:

  1. Download column information
  2. Download column list information
  3. Download individual article information

Note:

  1. We will create a folder based on the column ID and place the downloaded JSON files below

    As for why? What’s important in the 21st century? Data!! With the data, you can generate your own web pages, make them into Word, PDF, or any other, wireless option.

For the Python category in this article, I will post a piece of main process code: see the source JueJinColToPdf for more details

def downloadColumn(cid) :

    column = getColumn(cid);
    print("Success in obtaining column information, column name:", column["data"] ["column_version"] ["title"]);
    saveColumn(cid, json.dumps(column["data"]))

    articleList = getArticleList(cid)
    saveArticleList(cid, json.dumps(articleList["data"]))

    articleIds = list(map(lambda item: item["article_id"], articleList["data"]))
    print("Success in getting column list:", articleIds)

    for artId in articleIds:
        artContent = getArticleContent(artId)
        saveArticleContent(cid, artId, json.dumps(artContent["data"]))
        print("Article with ID %s downloaded" % artId)
Copy the code

Generate Html code

Say something directly

How to convert markDown format to HTML code

Giants on the shoulders of Markdown library

 html += markdown.markdown(
         removeScheme(article["article_info"] ["mark_content"]), extensions=extensions)
Copy the code

How do I highlight code

Pygments library,

pygmentize -S default -f html -a .codehilite > code.css
Copy the code

Then set the Extensions of a MarkDown library

extensions = [  # Add-on for different tutorials
    'markdown.extensions.extra'.'markdown.extensions.codehilite'.# Code highlighting extension
    'markdown.extensions.toc'.'markdown.extensions.tables'.'markdown.extensions.fenced_code',
]

 html += markdown.markdown(
            removeScheme(article["article_info"] ["mark_content"]), extensions=extensions)
Copy the code

Remove the nugget theme and code style

If you select a theme or code highlighting, the MarkDown header will have the following extra content.

---
theme: github
highlight: a11y-light
---
Copy the code

Solution: regular substitution or string interception

How to customize styles

Create a new extend.css

* {
    font-size: 30px; // Default font}img{
    display: block;  
    min-width: 50%; // Picture at least50%
}
Copy the code

How to generate a large HTML file

  1. Create an HTML file,
  2. core.cssandThe extend. CSS fileInside the HTML style tag
  3. Write the HTML in the header,
  4. Write the markDown converted HTML,
  5. Finally, write the HTML at the end.

What if there are too many articles

Divide it by a certain number of articles, such as 20 articles and a certain number of words, such as 50,000 words

It’s not implemented in my code, but that’s not a big deal. I used HTML to generate 400 + pages of PDFS.

Conversion to PDF

With the help of wkHTMLTopdf.exe, this magic can directly convert web addresses or local HTML files to PDF files. The most important thing is the perfect support directory.

Download from the official installation, and then configure the environment variables can be, the code on the following point.

def genPdf(title) :
    # TODO:: The global path needs to be configured
    exePath = "wkhtmltopdf.exe"
    sourcePath = "./htmls/%s.html" % title
    targetPath = "./pdfs/%s.pdf" % title
    cmd = '"%s" --outline-depth 2 --footer-center [page] "%s" "%s"' % (
        exePath, sourcePath, targetPath)
    print(cmd)
    subprocess.call(cmd)
Copy the code

Write in the last

Writing is not easy, but your praise and comments are my biggest motivation.