Hello, I’m Xiao Zhang, long time no see
Today’s article introduces a practical case, related to automated office; The idea of the case is derived from a demo I did for readers two days ago. The requirement is to roughly extract the content of hundreds of tables in Word (all tables in Word have the same style) and automatically store the extracted content in Excel
The form in Word is as follows
Currently there are several Word documents in the above form that need to be sorted. The goal is to use Python to automatically generate excel tables in the following form
Before explaining the formal case, take a look at the conversion effect. The script first converts doc files under the specified folder into DOCX, and then automatically generates an Excel table, which contains all the contents in Word
Involved in the library
The Python libraries used in this case are the following
python-docx
pandas
os
pywin32
Copy the code
Doc to DOCX
In this case, the table content in Word is extracted using python-docx library, some basic usage of python-docx can be referred to
.
Python-docx can only handle docX files. Before extracting the contents of the table, we need to convert the file type format from doc to DOCX.
The easiest way to convert doc to docX is to open a doc file through the Word component of Office and manually save it as a DOCX file. This method is ok for a single document, but it is a little annoying for hundreds of documents.
Here we introduce a Python library pyWin32 to help us solve this problem. As an extension module, PyWin32 encapsulates a large number of Windows API functions, such as calling application components such as Office, deleting specified files, obtaining mouse coordinates and so on
Use PyWin32 to control the Word component in Office to automatically open and save operations, and convert all doc file types into DOCX file types. The steps are divided into the following three steps:
1. Create a Word component
from win32com import client as wc
word = wc.Dispatch('Word.Application')
Copy the code
2. Open the Word file
doc = word.Documents.Open(path)
Copy the code
3. Save and close
doc.SaveAs(save_path,12, False, "", True, "", False, False, False, False)
doc.Close()
Copy the code
The complete code
path_list = os.listdir(path)
doc_list = [os.path.join(path,str(i)) for i in path_list if str(i).endswith('doc')]
word = wc.Dispatch('Word.Application')
print(doc_list)
for path in doc_list:
print(path)
save_path = str(path).replace('doc','docx')
doc = word.Documents.Open(path)
doc.SaveAs(save_path,12, False, "", True, "", False, False, False, False)
doc.Close()
print('{} Save sucessfully '.format(save_path))
word.Quit()
Copy the code
The DOCX library extracts the contents of a single table
Before batch operation, we need to deal with the contents of a single table first. Once we have dealt with a single Word, we can add a recursion for the rest
Docx library is used to extract Table content in Word, mainly using Table, rows, cells and other objects
Table represents a Table, and rows represents a list of rows in the Table in the form of an iterator. Cells represent a list of cells, also in the form of iterators
To do this, you need to know the following basic functions
-
The Document function reads the file path and returns a Document object
-
Document.tables returns a list of tables in Word;
-
Table. rows returns the row table in the table;
-
Row.cells returns a list of cells contained in the row;
-
Cell. text Returns the text information of the cell
After understanding the above content, the next operation idea is more clear; Word table text information can be completed through two for loops: the first for loop to get all row objects in the table, the second for loop to locate the cell of each row, with cell.text to get the cell text content;
Let’s see if this idea works in code
document = docx.Document(doc_path)
for table in document.tables:
for row_index,row in enumerate(table.rows):
for col_index,cell in enumerate(row.cells):
print(' pos index is ({},{})'.format(row_index,col_index))
print('cell text is {}'.format(cell.text))
Copy the code
It will be found that the final extracted content is repeated.
For example, the cells in the table below are merged (1,1)->(1,5). Docx library does not treat this kind of merged cells as one, but as a single one. Therefore, in the for iteration, (1,1)->(1,5) cells return five. Each cell text message returns is
In the face of the above text duplication problem, we need to add a deduplication mechanism, name, gender, age… Col_keys = col_keys = col_keys =… Bachelor, etc. as col_values, set an index when extracting, even as col_keys, odd as col_vaues;
The code is refactored as follows:
Document = docx.document (doc_path) col_keys = [] # col_values = [] # index_num = 0 for table in document.tables: for row_index,row in enumerate(table.rows): for col_index,cell in enumerate(row.cells): if fore_str ! = cell.text: if index_num % 2==0: col_keys.append(cell.text) else: col_values.append(cell.text) fore_str = cell.text index_num +=1 print(f'col keys is {col_keys}') print(f'col values is {col_values}')Copy the code
The final extraction effect is as follows
Batch word extraction, save to CSV file
Once you are able to process a single word file, you can recursively extract all the table contents of the Word text into a CSV file using pandas.
def GetData_frompath(doc_path): Document = docx.document (doc_path) col_keys = [] # col_values = [] # index_num = 0 for table in document.tables: for row_index,row in enumerate(table.rows): for col_index,cell in enumerate(row.cells): if fore_str ! = cell.text: if index_num % 2==0: col_keys.append(cell.text) else: col_values.append(cell.text) fore_str = cell.text index_num +=1 return col_keys,col_values pd_data = [] for index,single_path in enumerate(wordlist_path): col_names,col_values = GetData_frompath(single_path) if index == 0: pd_data.append(col_names) pd_data.append(col_values) else: pd_data.append(col_values) df = pd.DataFrame(pd_data) df.to_csv(word_paths+'/result.csv', encoding='utf_8_sig',index=False)Copy the code
The format of certificate number and ID card number
When opening the generated CSV file, we will find that the two columns of contact information and ID number are stored in numeric format, which is not the type we want. To display them completely, we need to convert the numerical value into text before storage
The solution is to find the cell and add a ‘\t’ TAB before the element
col_values[7] = '\t'+col_values[7]
col_values[8] = '\t'+col_values[8]
Copy the code
The source code for
The source code used in this case to obtain data, pay attention to wechat public number: Xiaozhang Python, in the public number back keyword: 210328 can be!
summary
This case only uses part of the methods in docX library, mainly involving the basic operation of Table in Word. For some students engaged in clerical work, they may encounter similar problems in daily work, so they are specially shared here, hoping to be helpful to everyone
Well, that’s all for this article, thanks for reading, and we’ll see you next time!