In this and the next chapter, we’ll look at two examples of file types: Excel files and PDF, and give some general instructions for other file types.
Excel is much more difficult to process than CSV, JSON, and XML files described in the previous chapter. The following uses UNICEF’s 2014 report as an example to explain how to process Excel data.
Related articles:
Ten minutes for a quick introduction to Python
Python data processing (I) : Processing data in JSON, XML and CSV formats
Install the Python package
To parse Excel files, you need a third-party package called XLRD. We used PIP to install third-party packages. On the command line, type the following installation command:
pip install xlrd
Copy the code
If command not found is displayed, install PIP first. IO /en/stable/ I… .
Parsing Excel files
Sometimes the easiest way to extract data from an Excel worksheet is to find a better way to get it. Direct analysis sometimes does not solve the problem. Before parsing, see if you can find data in other formats, such as CSV, JSON, XML, etc., and if you can’t find data in Excel.
There are three main libraries for processing Excel files.
-
XLRD reads Excel files.
-
XLWT writes to an Excel file and formats it.
-
Xlutils A set of advanced Excel tools (XLRD and XLWT must be installed first).
You will need to install each of these libraries separately to use them. But this chapter will only use XLRD.
The following steps explain how to parse Excel files.
Enter the XLRD library first, then open the workbook and save it in the book variable.
import xlrd
book = xlrd.open_workbook('./resource/data.xlsx')
Copy the code
Unlike CSV, Excel workbooks can have multiple tabs (tabs) or worksheets (sheets). To retrieve the data, we need to find the worksheet that contains the target data.
If there are several worksheets, you can guess the index number, but not if there are many worksheets. So you should know book.sheet_by_name(somename) command, where somename is the name of the worksheet you want to access.
Let’s take a look at the names of the worksheets:
import xlrd
book = xlrd.open_workbook('./resource/data.xlsx')
for sheet in book.sheets():
print(sheet.name)
Copy the code
Book.sheets () lists all sheets, sheet.name Prints out the name of the sheet. Output:
Data Notes
Table 9
Copy the code
The work Table we are looking for is Table 9. So we add this name to the script:
import xlrd
book = xlrd.open_workbook('./resource/data.xlsx')
sheet = book.sheet_by_name('Table 9')
print(sheet)
Copy the code
The run outputs a value like this:
<xlrd.sheet.Sheet object at 0x106af8898>
Copy the code
To see what methods a sheet has, print(dir(sheet)). Find an nrows method from the printed result, sheet.nrows returns the number of rows in the sheet. We will use NRows to iterate over the contents of each row.
import xlrd
book = xlrd.open_workbook('./resource/data.xlsx')
sheet = book.sheet_by_name('Table 9')
for i in range(sheet.nrows):
print(sheet.row_values(i))
Copy the code
Running the program yields the following output:
Once you’ve taken the data from the table, it’s time to figure out how to format the data to extract useful information. There are many formats for extracting information. Here we use one of them:
{
u'Afghanistan': {
'child_labor': {
'female': [9.6, ' '].'male': [11.0, ' '].'total': [10.3, ' ']},'child_marriage': {
'married_by_15': [15.0, ' '].'married_by_18': [40.4, ' ']
}
},
u'Albania': {
'child_labor': {
'female': [9.4, u' '].'male': [14.4, u' '].'total': [12.0, u' ']},'child_marriage': {
'married_by_15': [0.2, ' '].'married_by_18': [9.6, ' ']}},... }Copy the code
How do YOU determine the starting line of useful data
Once you can read Excel data, you need to extract useful information from it. It is important to know how to extract key data from complex data.
Method 1: Use software to open Excel for intuitive judgment
First of all, the easiest way is to use the software to open the Excel file visually, as shown in the picture below:
The format we defined above is country key, so the country should be found first. Look at the Excel table and display the country data starting at row 15. The data of Child Labour and Child marriage are from column E to column N.
Method two: use the program to test many times
If you don’t want to use the first method, or if you don’t have software on your computer to open the file, try the second method: write code many times.
This method uses the counter principle. Print the first 10 lines to see if there is any data you want. If there are no more lines from 11 to 20, check each section until you determine the exact number of lines.
The code is as follows:
import xlrd
book = xlrd.open_workbook('./resource/data.xlsx')
sheet = book.sheet_by_name('Table 9')
count = 0
for i in range(sheet.nrows):
if count < 10:
row = sheet.row_values(i)
print(i, row)
count += 1
Copy the code
First print and check the first 10 lines, check the console output did not find the desired country data, continue to adjust the test:
import xlrd
book = xlrd.open_workbook('./resource/data.xlsx')
sheet = book.sheet_by_name('Table 9')
count = 0
for i in range(10, sheet.nrows):
if count < 10:
row = sheet.row_values(i)
print(i, row)
count += 1
Copy the code
We already know that the first 10 lines don’t have the data we want, so range is just changed to range(10, sheet.nrows) and prints from line 10, leaving the rest unchanged. Run the program again and get the following output:
You can see that the country name appears from line 14, and that’s the data we’re looking for.
Three, assembly data
After finding the desired data in rows and columns, you can write code to extract and assemble the data in the format defined previously.
import xlrd
import pprint
book = xlrd.open_workbook('./resource/data.xlsx')
sheet = book.sheet_by_name('Table 9')
Define a dictionary for storing data
data = {}
for i in range(14, sheet.nrows):
row = sheet.row_values(i)
Extract the country name
country = row[1]
Assemble data according to the given format
data[country] = {
'child_labor': {
'total': [row[4], row[5]],
'male': [row[6], row[7]],
'female': [row[8], row[9]],},'child_marriage': {
'married_by_15': [row[10], row[11]],
'married_by_18': [row[12], row[13]].}}# The last country on the road was Zimbabwe. Break out of the loop
if country == 'Zimbabwe':
break
# Print data
pprint.pprint(data)
Copy the code
The pprint format is more aesthetically pleasing when printing complex objects.
Four,
- Work with three Excel libraries: XLRD, XLWT, and Xlutils. Decide which libraries to use as needed.
- After parsing Excel, there are two ways to determine the location of the data you want: open the graphical interface for direct observation and filter the data step by step through the program. If you don’t know what commands an object has, can you print
dir(obj)
Where obj is the object that you want to view the relevant command. - Figure out what format you want to output in advance, and it will be easier to assemble your data with a format.
- The pprint format is more aesthetically pleasing when printing complex objects.
That’s the complete tutorial on parsing Excel data in Python. Stay tuned for the next section on working with PDF files and solving problems with Python.