Unwatermark PDF images using Python. The idea is very simple, the code is very simple.

First consider how Python unwatermarks images, then reuse the idea to PDF.

This picture is a screenshot from the PDF of data Structure and Algorithms, with the watermark of the official account.

It is obvious from the above figure that the watermark color is generally light in order not to affect the reading of the text. Therefore, we can use the color difference feature to remove the watermark. That is: use Python to read the color of the image and whiten the light-colored parts.

PIL is the standard library for Python. Python2 is native to the system. Python3 needs to be installed by itself

pip install pillow
Copy the code

Once the installation is complete, read the picture and get the dimensions (width and height) of the picture

from PIL import Image

img = Image.open('watermark_pic.png')
width, height = img.size
Copy the code

Before we move on, let’s talk a little bit about color in computers. Optical three primary colors are red, green and blue (RGB), that is to say, they are three basic colors that can not be decomposed, other colors can be mixed by these three colors, three colors mixed in equal proportion is white, no light is black.

In a computer, three bytes can be used to represent RGB colors. The maximum value of one word is 255, so (255, 0, 0) represents red, (0, 255, 0) represents green, and (0, 0, 255) represents blue. Accordingly, (255, 255, 255) represents white and (0, 0, 0) represents black. Any combination from (0, 0, 0) to (255, 255, 255) can represent a different color.

Next we can read the RGB of the picture with the following code

for i in range(width):
    for j in range(height):
        pos = (i, j)
        print(img.getpixel(pos)[:3])
Copy the code

The color of each position in the picture is represented by a quad. The first three digits are RGB, and the fourth digit is Alpha channel, so we don’t need to care.

With RGB, we can modify it.

As can be seen from the figure, the RGB of the watermark is # d9D9D9, which is represented in hexadecimal, which is actually (217, 217, 217).

The closer each of these values gets to 255, the lighter the color becomes, and when they all become 255, they become white. So anywhere RGB is greater than 217, we can make it white. That is, the sum of RGB three digits is greater than or equal to 651.

if sum(img.getpixel(pos)[:3> =])651:
    img.putpixel(pos, (255.255.255))
Copy the code

The complete code is as follows:

from PIL import Image

img = Image.open('watermark_pic.png')
width, height = img.size

for i in range(width):
    for j in range(height):
        pos = (i, j)
        if sum(img.getpixel(pos)[:3> =])651:
            img.putpixel(pos, (255.255.255))

img.save('watermark_removed_pic.png')
Copy the code

With the above foundation, it is simple to remove the PDF watermark, the idea is to convert each PDF page into a picture, and then modify the RGB of the watermark, and finally output the picture.

Install the PyMupdf library to manipulate PDF files

pip install pymupdf
Copy the code

Read the PDF and transfer the image

import fitz


doc = fitz.open("Manual of Data Structures and Algorithms @ public code.pdf")

for page in doc:
    pix = page.get_pixmap()
Copy the code

The PDF has 480 pages, so you need to traverse each page and get the image PIX for each page. The PIx object is similar to the IMG object we saw above, and its RGB can be read and modified.

The page.get_pixmap() operation is irreversible, that is, it can convert PDF to image, but after changing the IMAGE RGB, it cannot be applied to PDF and can only be output as image.

Modify the watermark RGB as before, the difference is that RGB here is a triplet, no Alpha channel, the code is as follows:

from itertools import product

for pos in product(range(pix.width), range(pix.height)):
    if sum(pix.pixel(pos[0], pos[1> =))651:
        pix.set_pixel(pos[0], pos[1], (255.255.255))
Copy the code

The complete code is as follows:

from itertools import product
import fitz


doc = fitz.open("Manual of Data Structures and Algorithms @ public code.pdf")

page_no = 0
for page in doc:
    pix = page.get_pixmap()

    for pos in product(range(pix.width), range(pix.height)):
        if sum(pix.pixel(pos[0], pos[1> =))651:
            pix.set_pixel(pos[0], pos[1], (255.255.255))

    pix.pil_save(f"pdf_pics/page_{page_no}.png", dpi=(30000.30000))

    print('the first f{page_no}Page removal done ')
    page_no += 1
Copy the code

There are drawbacks to this approach. First, the output is not in PDF format; Second, the output picture is fuzzy, the follow-up needs to be optimized, the best is to modify the PDF directly.

Continue to share Python basics and tools to use.