The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

Author: Liu Zhi-qi Source: CSDN

Link to this article: blog.csdn.net/weixin\_418…

This article will explain how to extract text from PPT and write it to Word. It will cover how to interoperate Word and PPT files using python-ppTX and python-docx.

This article still comes from real office automation needs!

I. Requirements description

There is a POWERPOINT presentation with an introduction to Python, as shown below. Now we need to extract all the words in PPT and write them into Word, as shown below

Two, involving knowledge

The code is actually very simple, based on two modules python-PPTX and python-docx, with only 6 lines of core code. However, it is necessary to be familiar with the format of PPT and Word files first. You can have an intuitive understanding of the structure of Word through the following diagram

A Word document composed of pure words is composed of a three-level structure of document- paragraph- paragraph block run****, without considering tables, pictures, etc. Take a look at the structure of PPT, which is much more complex than Word. Of course, this is also related to the high degree of customization and expansion of PPT

To put it simply, a PPT file is a presentation, and its basic structure is composed of a presentation file, a slide page, and a slide-shape. The shape needs to be separated from the shape containing text or the shape without text (pure pictures, etc.). If it is a shape that contains text, you can get the internal text box, which can be considered as a small Word document containing the paragraph- the text block run

With this knowledge you can write code.

Third, Python implementation

Import the required modules first

  1. from pptx import Presentation

  2. from docx import Document****

Note that python-docx and python-pptx are installed, but in practice they are both PPTX and docx. The similarities between the two modules are as follows:

  • The installation name is different from the import name

  • Install with python- new version suffix, import with new version suffix

Now import the PPTX file and create the Word file

  1. wordfile = Document()

  2. The path to the PPT file is given

  3. filepath = r'xxxxxxxx'

  4. pptx = Presentation(filepath)

Then go through the PPT and write the words into Word

  1. Walk through all the slide pages of the PPT file

  2. for slide in pptx.slides:

  3. Walk through all shapes of the slide page

  4. for shape in slide.shapes:

  5. Check whether the shape contains a text box, if so, run the code sequentially

  6. if shape.has_text_frame:

  7. Get the text box

  8. text_frame = shape.text_frame

  9. # Walk through all paragraphs in the text box

  10. for paragraph in text_frame.paragraphs:

  11. Write the paragraph text from the text box into Word

  12. wordfile.add_paragraph(paragraph.text)

When traversing PPT to paragraphs, write Word instead of traversing to text blocks, because paragraphs are more consistent with reading habits. Generally, traversing to text blocks requires the operation of specific field Word blocks. At last, remember to save Word files

  1. save_path = r'xxxxxxxx'

  2. wordfile.save(save_path)

summary

This is a real case that has been adapted to some extent, so it can be seen that Python office automation can help us to free up our hands. However, before writing automation scripts, it is necessary to master the principles and make clear ideas before doing so. Finally, I hope you understand that one of the core aspects of Python Office Automation is batch operations – free your hands and automate complex tasks!