Van came Python | docx, PDF save up to data

This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

0 x1, introduction

Section on the Van came Python | a planet’s simple crawl with readers in the background to the private chat I say:

The results of the crawl are saved to Markdown, which is not easy to view on your phone,

Outrageous, cough cough, maybe they will use later, meaning toss next bar, then afternoon touch fish time, using a wave of keywords, see two kinds of conventional play, have a try, first toss seemingly simple library ↓

pandoc

0x2. Initial experience with PanDoc Library

Support super! Super! Super! Multitype conversions, with the following mess:

Okay, so you don’t have to look at what file formats are supported, you can think of almost anything, but what we’re trying to do here is convert Markdown to PDF.

For details, see Install. md. You can either download the zip package directly for Windows or INSTALL it using Choco.

Then try to configure environment variables so that pandoc can be executed everywhere:

Unzip the package → Go to folder → Select Pandoc. exe → Hold Shift right click → Copy PATH → This computer → Right click open Properties → Find Advanced System Settings → Environment variables → At System variables (S) → Find PATH → Edit → New → Paste the path you just copied here:

After the configuration, open CMD and type: pandoc -v.

If the configuration is successful, you can directly dump the decompressed file into the Python Scripts directory and run the where command to obtain the Python installation directory.

Go to the path below and just drop all your files here

After configuration, see how to use it:

Relatively simple, is the command line execution:

Pandoc -o File to be converted File to be convertedCopy the code

Convert TXT to PDF:

A latex engine needs to be specified, with the following options:

It’s a bit of a hassle. Some Latex images have to be created separately, with more than four gigabytes, so we can convert them directly to Word document format → docx:

pandoc 123.txt -o test.docx
Copy the code

Open to see the effect:

Ok, then some general operations to traverse the folder, concatenating CMD strings and executing commands using subprocess, as shown in the following code:

def md_to_doc(file_path) :
    cmd = "pandoc {} -o {}"
    sep_split = file_path.split(os.path.sep)
    # Switch to the image directory
    os.chdir(output_root_dir)
    # check whether the folder exists, do not create
    cp_file_utils.is_dir_existed(os.path.join(doc_save_dir, sep_split[-2]))
    doc_file_path = os.path.join(doc_save_dir, '{}{}{}.docx'.format(sep_split[-2], os.path.sep, sep_split[-1] [: -4]))
    subprocess.call(cmd.format(file_path, doc_file_path), shell=True)
    print("Generate file:", doc_file_path)
Copy the code

File generation is complete, then to write a script to synthesize so many Word documents, need to use the following library (direct PIP installation can be) :

pip install python-docx
pip install docxcompose
Copy the code

Next direct liver code:

from docx import Document
from docxcompose.composer import Composer

def compose_docx(docx_list) :
    # first file
    master = Document(docx_list[0])
    master.add_page_break()  # Force a new page
    composer = Composer(master)
    # Subsequent file appending merge
    for docx in docx_list[1] :print("Current processing file:", docx)
        temp = Document(docx)
        temp.add_page_break()
        composer.append(temp)
    composer.save("result.docx")
    print("File merge completed...")
Copy the code

Run and wait for the program to run, because the default merge order is by filename, and we created the file using a timestamp, so don’t worry about the order. Looking at the synthesized document, you can:

841MB, 2927 pages, WPS opened instantly stuck:

0x3 slightly more troublesome scheme

Moving on to the second scenario, which is to render Markdown into HTML and then PDF using these two libraries:

pip install markdown
pip install pdfkit
Copy the code

Also: wkhtmlTopdf, also download the zip package, configure the environment variable way:

Run the wkhtmltopdf -v command to check whether the configuration takes effect ~

Then you can go ahead and write the following test demo:

import pdfkit
from markdown import markdown

def md_to_pdf(file_path) :
    html = markdown(cp_file_utils.read_file_text_content(file_path), output_format='html')
    pdfkit.from_string(html, "out.pdf", options={'encoding': 'utf-8'})
Copy the code

Pass the path to the md file if it appears after running:

Pdfkit OSError: No wkhtmltopdf executable found

If the above environment variable does not take effect, restart a window, or specify the path with the following code:

import pdfkit
config = pdfkit.configuration(wkhtmltopdf=r"D:\xxx\bin\wkhtmltopdf.exe")
pdfkit.from_url(html, filename, configuration=config)
Copy the code

The following error was reported:

Try-except catch a wave of exceptions when an external resource is referenced but not found:

def md_to_pdf(file_path) :
    html = markdown(cp_file_utils.read_file_text_content(file_path), output_format='html')
    try:
        pdfkit.from_string(html, "out.pdf", options={'encoding': 'utf-8'})
    except IOError as e:
        # Ignore exceptions directly
        pass
    finally:
        print("File generated...")
Copy the code

After running, open the generated PDF and see the result:

This is fine, but the default rendering, which does not support annotations, tables, LaTeX, code blocks, flowcharts, sequence diagrams, and Gantt diagrams, requires more extensions.

Markdown module extension

The following is an example:

# Enable the tables extension
html = markdown(text, output_format='html', extensions=['tables'])
Copy the code

Third-party extensions

The following is an example:

Install the math package
pip install python-markdown-math

# Enable math package extension
text = markdown(text, output_format='html', extensions=['mdx_math'])
Copy the code

Export the HTML (such as job tribes) using a third-party Markdown HTML rendering tool and convert it to a PDF

The following is an example:

pdfkit.from_file('test.html'.'test.pdf', options={'encoding': 'utf-8'})
Copy the code

For more details, see: Python converts MarkDown to PDF (perfect HTML rendering, LaTeX, tables, etc.)

0 x4, summary

Look at all very simple, in fact, there is a deep custom style requirements, have to do STH over and over again, but fortunately, THE author does not have, can see the line, interested readers can find their own ~

Climb data save method skills +1, reading experience is also more, 23333, the above is the entire content of this article, have any questions welcome to point out the comment area, thank you ~

Van came Python | docx, PDF save up to data

0 x1, introduction

0x2. Initial experience with PanDoc Library

0x3 slightly more troublesome scheme

0 x4, summary

Related Posts

3 years of PHPer interview summary

Linux GDB – Multithreaded debugging

Improvement of Bytedance on RocksDB storage engine