Hello, I’m Xiao Zhang, long time no see

Today’s article introduces a practical case, related to automated office; The idea of the case is derived from a demo I did for readers two days ago. The requirement is to roughly extract the content of hundreds of tables in Word (all tables in Word have the same style) and automatically store the extracted content in Excel

The form in Word is as follows

Currently there are several Word documents in the above form that need to be sorted. The goal is to use Python to automatically generate excel tables in the following form

Before explaining the formal case, take a look at the conversion effect. The script first converts doc files under the specified folder into DOCX, and then automatically generates an Excel table, which contains all the contents in Word

Involved in the library

The Python libraries used in this case are the following

python-docx
pandas
os
pywin32
Copy the code

Doc to DOCX

In this case, the table content in Word is extracted using python-docx library, some basic usage of python-docx can be referred to

.

Python-docx can only handle docX files. Before extracting the contents of the table, we need to convert the file type format from doc to DOCX.

The easiest way to convert doc to docX is to open a doc file through the Word component of Office and manually save it as a DOCX file. This method is ok for a single document, but it is a little annoying for hundreds of documents.

Here we introduce a Python library pyWin32 to help us solve this problem. As an extension module, PyWin32 encapsulates a large number of Windows API functions, such as calling application components such as Office, deleting specified files, obtaining mouse coordinates and so on

Use PyWin32 to control the Word component in Office to automatically open and save operations, and convert all doc file types into DOCX file types. The steps are divided into the following three steps:

1. Create a Word component

from win32com import client as wc
word = wc.Dispatch('Word.Application')
Copy the code

2. Open the Word file

doc = word.Documents.Open(path)
Copy the code

3. Save and close

doc.SaveAs(save_path,12, False, "", True, "", False, False, False, False)
doc.Close()
Copy the code

The complete code

	path_list = os.listdir(path)
    doc_list = [os.path.join(path,str(i)) for i in path_list if str(i).endswith('doc')]
    word = wc.Dispatch('Word.Application')
    print(doc_list)
    for path in doc_list:
        print(path)
        save_path = str(path).replace('doc','docx')
        doc = word.Documents.Open(path)
        doc.SaveAs(save_path,12, False, "", True, "", False, False, False, False)
        doc.Close()
        print('{} Save sucessfully '.format(save_path))
    word.Quit()
Copy the code

The DOCX library extracts the contents of a single table

Before batch operation, we need to deal with the contents of a single table first. Once we have dealt with a single Word, we can add a recursion for the rest

Docx library is used to extract Table content in Word, mainly using Table, rows, cells and other objects

Table represents a Table, and rows represents a list of rows in the Table in the form of an iterator. Cells represent a list of cells, also in the form of iterators

To do this, you need to know the following basic functions

  • The Document function reads the file path and returns a Document object

  • Document.tables returns a list of tables in Word;

  • Table. rows returns the row table in the table;

  • Row.cells returns a list of cells contained in the row;

  • Cell. text Returns the text information of the cell

After understanding the above content, the next operation idea is more clear; Word table text information can be completed through two for loops: the first for loop to get all row objects in the table, the second for loop to locate the cell of each row, with cell.text to get the cell text content;

Let’s see if this idea works in code

	document = docx.Document(doc_path)
    for table in document.tables:
        for row_index,row in enumerate(table.rows):
            for col_index,cell in enumerate(row.cells):
                print(' pos index is ({},{})'.format(row_index,col_index))
                print('cell text is {}'.format(cell.text))
Copy the code

It will be found that the final extracted content is repeated.

For example, the cells in the table below are merged (1,1)->(1,5). Docx library does not treat this kind of merged cells as one, but as a single one. Therefore, in the for iteration, (1,1)->(1,5) cells return five. Each cell text message returns is

In the face of the above text duplication problem, we need to add a deduplication mechanism, name, gender, age… Col_keys = col_keys = col_keys =… Bachelor, etc. as col_values, set an index when extracting, even as col_keys, odd as col_vaues;

The code is refactored as follows:

Document = docx.document (doc_path) col_keys = [] # col_values = [] # index_num = 0 for table in document.tables: for row_index,row in enumerate(table.rows): for col_index,cell in enumerate(row.cells): if fore_str ! = cell.text: if index_num % 2==0: col_keys.append(cell.text) else: col_values.append(cell.text) fore_str = cell.text index_num +=1 print(f'col keys is {col_keys}') print(f'col values is {col_values}')Copy the code

The final extraction effect is as follows

Batch word extraction, save to CSV file

Once you are able to process a single word file, you can recursively extract all the table contents of the Word text into a CSV file using pandas.

def GetData_frompath(doc_path): Document = docx.document (doc_path) col_keys = [] # col_values = [] # index_num = 0 for table in document.tables: for row_index,row in enumerate(table.rows): for col_index,cell in enumerate(row.cells): if fore_str ! = cell.text: if index_num % 2==0: col_keys.append(cell.text) else: col_values.append(cell.text) fore_str = cell.text index_num +=1 return col_keys,col_values pd_data = [] for index,single_path in enumerate(wordlist_path): col_names,col_values = GetData_frompath(single_path) if index == 0: pd_data.append(col_names) pd_data.append(col_values) else: pd_data.append(col_values) df = pd.DataFrame(pd_data) df.to_csv(word_paths+'/result.csv', encoding='utf_8_sig',index=False)Copy the code

The format of certificate number and ID card number

When opening the generated CSV file, we will find that the two columns of contact information and ID number are stored in numeric format, which is not the type we want. To display them completely, we need to convert the numerical value into text before storage

The solution is to find the cell and add a ‘\t’ TAB before the element

col_values[7] = '\t'+col_values[7]
col_values[8] = '\t'+col_values[8]
Copy the code

The source code for

The source code used in this case to obtain data, pay attention to wechat public number: Xiaozhang Python, in the public number back keyword: 210328 can be!

summary

This case only uses part of the methods in docX library, mainly involving the basic operation of Table in Word. For some students engaged in clerical work, they may encounter similar problems in daily work, so they are specially shared here, hoping to be helpful to everyone

Well, that’s all for this article, thanks for reading, and we’ll see you next time!