The reason

My colleague found a Tencent micro service development document, but it is not convenient to use the Internet all the time because of work, so he hoped that I could help him to climb down this document and generate PDF format. At first, it was very simple, but it was gitbook, because there are a lot of online tools

Not simple links: tsf-gitbook-1257356411.cos.ap-chengdu.myqcloud.com/1.12.4/usag…

Links are not simple places

The root link (tsf-gitbook-1257356411.cos.ap-chengdu.myqcloud.com/1.12.4/usag…
Many links are cluttered, and the number and content of links in the HTML source code shown in the first and second chapters are different
Some links can not be clicked, the article may not be finished
Because the root link is inaccessible, you can only use the Chinese path to convert, which makes many ready-made tools unavailable

(github.com/TruthHun/co…).

This non-standard Gitbook can only code itself to achieve crawling

Train of thought

Find a tool that converts HTML to PDF
Get all the HTML and package it into a ZIP file, using tools to convert it directly to PDF

Use off-the-shelf tools

Download tool

Calibre (HTML to PDF)

Installing calibre
Download it at calibre-ebook.com/download
Calibre is properly installed on your system. (Note that calibre is installed in the 3.x version, which is not very powerful. Just get the latest one anyway. After installing Calibre, add calibre to the system environment variable and execute the following command to display the 3.x version.

ebook-convert --version

Google Chrome (save current HTML)

Usage:

I have yet to find an HTML tool that can save all the current pages with one click

use

Generate. Epub format file, not detailed here, please refer to the code implementation below for details (for implementation details)
Using the commandebook-convert demo.epub demo.pdf

encoding

This way evolved from the use of tools, the specific idea is consistent

Get everything and save it as HTML
1. Obtain all pages to be crawled based on the url configured in the configuration file
2. It is found that there is a link similar to the directory on the left in the HTML. This KIND of HTML is very unfriendly to the PDF directory
3. Here we just get the contents of the BookBody in the GitBook and generate the HTML ourselves by synthesizing it

body := htmlquery.Find(doc, "//div[@class='page-inner']") if len(body) ! = 0 { pdfBody := body[0] htmlBody := htmlquery.OutputHTML(pdfBody, true) htmlTempleta := `<! DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no"> <title>%v</title> <link href="gitbook.css" rel="stylesheet"> </head> <body>%v </body> </html>` htmlTempleta = fmt.Sprintf(htmlTempleta, book.Title, htmlBody)Copy the code

Generate directory HTML

htmlTempleta := `<! DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no"> <title>%v</title> <link href="gitbook.css" rel="stylesheet"> </head> <body>%v </body> </html>` htmlTempleta = fmt.Sprintf(htmlTempleta, value, value)Copy the code

Assemble the EPUB file (Refer to the link)
1. Generate the mimetype
2. The container. XML file is generated
3. Generate directory files
4. Generate the home page log
5. Generate the content.opf file
6. Package and assemble the above build files as zip files, then change the suffix to epub

Convert HTML to PDF using calibre command (PDF optional)

args := []string{ this.BasePath+"/content.epub", this.BasePath + "/" + output + "/book.pdf", If len(this.config.papersize) > 0 {args = append(args, "--paper-size", This.config. PaperSize)} if len(this.config. FontSize) > 0 {args = append(args, "-- pdF-default-font size", this.Config.FontSize) } //header template if len(this.Config.Header) > 0 { args = append(args, "--pdf-header-template", this.Config.Header) } //footer template if len(this.Config.Footer) > 0 { args = append(args, "--pdf-footer-template", this.Config.Footer) } if len(this.Config.MarginLeft) > 0 { args = append(args, "--pdf-page-margin-left", this.Config.MarginLeft) } if len(this.Config.MarginTop) > 0 { args = append(args, "--pdf-page-margin-top", this.Config.MarginTop) } if len(this.Config.MarginRight) > 0 { args = append(args, "--pdf-page-margin-right", this.Config.MarginRight) } if len(this.Config.MarginBottom) > 0 { args = append(args, "--pdf-page-margin-bottom", This.config.marginbottom)} if len(this.config. More) > 0 {args = append(args, this.config. More... } fmt.Println(args) cmd := exec.Command(ebookConvert, args...) return cmd.Run()Copy the code

Matters needing attention

This approach is not practical for general purpose Gitbook, if you want to work, you need to modify the crawl logic, specific code incrawl/htmlspider, modify the specific logic to be captured
You need to modify the automatically generated JSON file because some links cannot be redirected. For details, see the Github code document
The source code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Go Language Development Tools (Gitbook to PDF)

The reason

Links are not simple places

Train of thought

Use off-the-shelf tools

Download tool

Calibre (HTML to PDF)

Google Chrome (save current HTML)

use

encoding

Matters needing attention

Go Language Development Tools (Gitbook to PDF)

The reason

Links are not simple places

Train of thought

Use off-the-shelf tools

Download tool

Calibre (HTML to PDF)

Google Chrome (save current HTML)

use

encoding

Matters needing attention

Related Posts

Git usage specification: Collection take your time

Behind WebRTC’s high sound quality and low delay — AGC (Automatic Gain Control)

Linux: How to split files, no longer limited by 4G size