The reason
My colleague found a Tencent micro service development document, but it is not convenient to use the Internet all the time because of work, so he hoped that I could help him to climb down this document and generate PDF format. At first, it was very simple, but it was gitbook, because there are a lot of online tools
Not simple links: tsf-gitbook-1257356411.cos.ap-chengdu.myqcloud.com/1.12.4/usag…
Links are not simple places
- The root link (tsf-gitbook-1257356411.cos.ap-chengdu.myqcloud.com/1.12.4/usag…
- Many links are cluttered, and the number and content of links in the HTML source code shown in the first and second chapters are different
- Some links can not be clicked, the article may not be finished
- Because the root link is inaccessible, you can only use the Chinese path to convert, which makes many ready-made tools unavailable
(github.com/TruthHun/co…).
This non-standard Gitbook can only code itself to achieve crawling
Train of thought
- Find a tool that converts HTML to PDF
- Get all the HTML and package it into a ZIP file, using tools to convert it directly to PDF
Use off-the-shelf tools
Download tool
Calibre (HTML to PDF)
- Installing calibre
- Download it at calibre-ebook.com/download
- Calibre is properly installed on your system. (Note that calibre is installed in the 3.x version, which is not very powerful. Just get the latest one anyway. After installing Calibre, add calibre to the system environment variable and execute the following command to display the 3.x version.
ebook-convert --version
Google Chrome (save current HTML)
Usage:
I have yet to find an HTML tool that can save all the current pages with one click
use
- Generate. Epub format file, not detailed here, please refer to the code implementation below for details (for implementation details)
- Using the command
ebook-convert demo.epub demo.pdf
encoding
This way evolved from the use of tools, the specific idea is consistent
-
Get everything and save it as HTML
- Obtain all pages to be crawled based on the url configured in the configuration file
- It is found that there is a link similar to the directory on the left in the HTML. This KIND of HTML is very unfriendly to the PDF directory
- Here we just get the contents of the BookBody in the GitBook and generate the HTML ourselves by synthesizing it
body := htmlquery.Find(doc, "//div[@class='page-inner']") if len(body) ! = 0 { pdfBody := body[0] htmlBody := htmlquery.OutputHTML(pdfBody, true) htmlTempleta := `<! DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no"> <title>%v</title> <link href="gitbook.css" rel="stylesheet"> </head> <body>%v </body> </html>` htmlTempleta = fmt.Sprintf(htmlTempleta, book.Title, htmlBody)Copy the code
- Generate directory HTML
htmlTempleta := `<! DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no"> <title>%v</title> <link href="gitbook.css" rel="stylesheet"> </head> <body>%v </body> </html>` htmlTempleta = fmt.Sprintf(htmlTempleta, value, value)Copy the code
- Assemble the EPUB file (Refer to the link)
- Generate the mimetype
- The container. XML file is generated
- Generate directory files
- Generate the home page log
- Generate the content.opf file
- Package and assemble the above build files as zip files, then change the suffix to epub
- Convert HTML to PDF using calibre command (PDF optional)
args := []string{ this.BasePath+"/content.epub", this.BasePath + "/" + output + "/book.pdf", If len(this.config.papersize) > 0 {args = append(args, "--paper-size", This.config. PaperSize)} if len(this.config. FontSize) > 0 {args = append(args, "-- pdF-default-font size", this.Config.FontSize) } //header template if len(this.Config.Header) > 0 { args = append(args, "--pdf-header-template", this.Config.Header) } //footer template if len(this.Config.Footer) > 0 { args = append(args, "--pdf-footer-template", this.Config.Footer) } if len(this.Config.MarginLeft) > 0 { args = append(args, "--pdf-page-margin-left", this.Config.MarginLeft) } if len(this.Config.MarginTop) > 0 { args = append(args, "--pdf-page-margin-top", this.Config.MarginTop) } if len(this.Config.MarginRight) > 0 { args = append(args, "--pdf-page-margin-right", this.Config.MarginRight) } if len(this.Config.MarginBottom) > 0 { args = append(args, "--pdf-page-margin-bottom", This.config.marginbottom)} if len(this.config. More) > 0 {args = append(args, this.config. More... } fmt.Println(args) cmd := exec.Command(ebookConvert, args...) return cmd.Run()Copy the code
Matters needing attention
- This approach is not practical for general purpose Gitbook, if you want to work, you need to modify the crawl logic, specific code in
crawl/htmlspider
, modify the specific logic to be captured - You need to modify the automatically generated JSON file because some links cannot be redirected. For details, see the Github code document
- The source code