preface

This article is suitable for adding a directory/bookmark to a PDF without any directory.

This paper does not propose innovation. I just copied the things of my predecessors, so that I could check them later. I would like to thank the predecessors who put forward this method. This is my first time to write an article in nuggets, if there is a mistake, please criticize and correct!

Things that need to be prepared

  1. FreePic2Pdf, used to add a catalog to PDF.
  2. SysTools PDF Unlocker, used to unlock PDF Sometimes download PDF protection, can not edit, you can use this software to unlock. Not a must
  3. Regular expressions, but not often. Regular expression basic learning
  4. An editor, such as Visual Studio Code, that can use regular expressions for find and replace

start

Here, I use a scanned PDF

Extract the directory

Based on the usage example, the PDF directory structure and the editor’s correspondence are shown below:

The easiest way to do it

Sometimes, a download site will provide a table of contents, as shown below

In this case, copy the contents of the directory directly to the text editor and extract the directory, as shown in the following figure:

Note: this directory is not page number, some websites provide page number, if there is a page number will be copied in the page number can be sometimes direct Baidu directory may have magic effect!

Type text PDF

If the PDF is a text PDF, you can copy text. Copy the table of contents of the books in PDF directly to the text editor, fine-tune the format, and finally achieve the same results as in the least troublesome method

Scanned PDF (subsequent steps will be based on this)

If you use a scanned PDF, creating a table of contents is a bit trickier, because the text cannot be copied and the downloaded web page does not provide a table of contents, so you have to manually type in the editor. Another option is to use OCR word recognition, which reduces manual effort, especially in large PDF files. In this case, I used a scanned PDF, so I used OCR to roughly identify the text of the table of contents, copy it to the editor, and adjust it, as shown below:

The page number is not given, because OCR does not give the page number, so you have to add it yourself. Added here, I’m adding page numbers in the format of title + space + page number

Add a page number to the extracted directory (skip this section if the page number is copied at the same time)

The OCR I was using didn’t recognize the page, so I had to manually add the page number. The final picture is as follows (I’m exhausted) :

Format the directory in the editor to meet the requirements of FreePic2Pdf

FreePic2Pdf is a software that uses the \ T (TAB) character to hierarchy the directory

Section 1 (previous one \ T) Part 1 (Previous Two \ T)Copy the code

So, our goal is the following:

Process the first level headings

We will use regular expressions to handle this. According to the PDF directory structure in this article, the top-level structure is part X, which does not need to do any processing. In chapter X, we need to add a TAB character in front of it, so query the regular expression as ^(chapter [0-9]{1,2}.*) and replace the regular expression with \t$1. The specific operation is shown below:

Handle the second layer of headings

The structure of the second layer title is n, title, query regular expression is ^(.,.+), replacement regular expression is \t\t$1. The diagram below:

If there are more layers of headings, they are also layered according to the sub-method

Handling special headings

We can see that there are also some irregular titles in the table of contents, such as “What History tells us about today……” , we need to adjust it to the appropriate position, either using regular expressions or manually

Set the page number to a suitable format

The page number that follows each title is very important, so if you want to click on the title in the PDF to jump to the corresponding page, you must set it.

  • Headings to pages can only be a \ T
  • Copy the Spaces from the title to the page number and replace them with \t

We use a regular expression to find the part of the page that is relevant, and then replace the preceding space with \t. The query expression is (\s{1,1})([0-9]{1,3}) and the replacement expression is \t$2 as shown below

After checking without error, the work of extracting directory is done!

Add directories using software

With the above steps complete, you can use software to add directories. Below is a screenshot of the software in use

After this step, open the interface file folder

Open the.txt file shown above

Save the modified TXT file, open the. Itf file, save the modification

You can check to see if the directories and page numbers match, and if not, modify the BasePage

The final step is shown below

Final rendering

The resources

  • Batch add PDF catalog (most complete detailed method) – CSDN
  • How to decrypt encrypted PDF files effectively? – Li Xiao’s answer – Zhihu
  • Regular expression foundation learning – nuggets