preface
This article is suitable for adding a directory/bookmark to a PDF without any directory.
This paper does not propose innovation. I just copied the things of my predecessors, so that I could check them later. I would like to thank the predecessors who put forward this method. This is my first time to write an article in nuggets, if there is a mistake, please criticize and correct!
Things that need to be prepared
- FreePic2Pdf, used to add a catalog to PDF.
- SysTools PDF Unlocker, used to unlock PDF Sometimes download PDF protection, can not edit, you can use this software to unlock. Not a must
- Regular expressions, but not often. Regular expression basic learning
- An editor, such as Visual Studio Code, that can use regular expressions for find and replace
start
Here, I use a scanned PDF
Extract the directory
Based on the usage example, the PDF directory structure and the editor’s correspondence are shown below:
The easiest way to do it
Sometimes, a download site will provide a table of contents, as shown below
In this case, copy the contents of the directory directly to the text editor and extract the directory, as shown in the following figure:
Note: this directory is not page number, some websites provide page number, if there is a page number will be copied in the page number can be sometimes direct Baidu directory may have magic effect!
Type text PDF
If the PDF is a text PDF, you can copy text. Copy the table of contents of the books in PDF directly to the text editor, fine-tune the format, and finally achieve the same results as in the least troublesome method
Scanned PDF (subsequent steps will be based on this)
If you use a scanned PDF, creating a table of contents is a bit trickier, because the text cannot be copied and the downloaded web page does not provide a table of contents, so you have to manually type in the editor. Another option is to use OCR word recognition, which reduces manual effort, especially in large PDF files. In this case, I used a scanned PDF, so I used OCR to roughly identify the text of the table of contents, copy it to the editor, and adjust it, as shown below:
The page number is not given, because OCR does not give the page number, so you have to add it yourself. Added here, I’m adding page numbers in the format of title + space + page number
Add a page number to the extracted directory (skip this section if the page number is copied at the same time)
The OCR I was using didn’t recognize the page, so I had to manually add the page number. The final picture is as follows (I’m exhausted) :
Format the directory in the editor to meet the requirements of FreePic2Pdf
FreePic2Pdf is a software that uses the \ T (TAB) character to hierarchy the directory
Section 1 (previous one \ T) Part 1 (Previous Two \ T)Copy the code
So, our goal is the following:
Process the first level headings
We will use regular expressions to handle this. According to the PDF directory structure in this article, the top-level structure is part X, which does not need to do any processing. In chapter X, we need to add a TAB character in front of it, so query the regular expression as ^(chapter [0-9]{1,2}.*) and replace the regular expression with \t$1. The specific operation is shown below:
Handle the second layer of headings
The structure of the second layer title is n, title, query regular expression is ^(.,.+), replacement regular expression is \t\t$1. The diagram below:
If there are more layers of headings, they are also layered according to the sub-method
Handling special headings
We can see that there are also some irregular titles in the table of contents, such as “What History tells us about today……” , we need to adjust it to the appropriate position, either using regular expressions or manually
Set the page number to a suitable format
The page number that follows each title is very important, so if you want to click on the title in the PDF to jump to the corresponding page, you must set it.
- Headings to pages can only be a \ T
- Copy the Spaces from the title to the page number and replace them with \t
We use a regular expression to find the part of the page that is relevant, and then replace the preceding space with \t. The query expression is (\s{1,1})([0-9]{1,3}) and the replacement expression is \t$2 as shown below
After checking without error, the work of extracting directory is done!
Add directories using software
With the above steps complete, you can use software to add directories. Below is a screenshot of the software in use
After this step, open the interface file folder
Open the.txt file shown above
Save the modified TXT file, open the. Itf file, save the modification
You can check to see if the directories and page numbers match, and if not, modify the BasePage
The final step is shown below
Final rendering
The resources
- Batch add PDF catalog (most complete detailed method) – CSDN
- How to decrypt encrypted PDF files effectively? – Li Xiao’s answer – Zhihu
- Regular expression foundation learning – nuggets