In this and the next chapter, we’ll look at two examples of file types: Excel files and PDF, and give some general instructions for other file types.

Excel is much more difficult to process than CSV, JSON, and XML files described in the previous chapter. The following uses UNICEF’s 2014 report as an example to explain how to process Excel data.

Related articles:

Ten minutes for a quick introduction to Python

Python data processing (I) : Processing data in JSON, XML and CSV formats

Install the Python package

To parse Excel files, you need a third-party package called XLRD. We used PIP to install third-party packages. On the command line, type the following installation command:

pip install xlrd
Copy the code

If command not found is displayed, install PIP first. IO /en/stable/ I… .

Parsing Excel files

Sometimes the easiest way to extract data from an Excel worksheet is to find a better way to get it. Direct analysis sometimes does not solve the problem. Before parsing, see if you can find data in other formats, such as CSV, JSON, XML, etc., and if you can’t find data in Excel.

There are three main libraries for processing Excel files.

XLRD reads Excel files.
XLWT writes to an Excel file and formats it.
Xlutils A set of advanced Excel tools (XLRD and XLWT must be installed first).

You will need to install each of these libraries separately to use them. But this chapter will only use XLRD.

The following steps explain how to parse Excel files.

Enter the XLRD library first, then open the workbook and save it in the book variable.

import xlrd
book = xlrd.open_workbook('./resource/data.xlsx')
Copy the code

Unlike CSV, Excel workbooks can have multiple tabs (tabs) or worksheets (sheets). To retrieve the data, we need to find the worksheet that contains the target data.

If there are several worksheets, you can guess the index number, but not if there are many worksheets. So you should know book.sheet_by_name(somename) command, where somename is the name of the worksheet you want to access.

Let’s take a look at the names of the worksheets:

import xlrd

book = xlrd.open_workbook('./resource/data.xlsx')

for sheet in book.sheets():
    print(sheet.name)
Copy the code

Book.sheets () lists all sheets, sheet.name Prints out the name of the sheet. Output:

Data Notes
Table 9
Copy the code

The work Table we are looking for is Table 9. So we add this name to the script:

import xlrd

book = xlrd.open_workbook('./resource/data.xlsx')

sheet = book.sheet_by_name('Table 9')
print(sheet)
Copy the code

The run outputs a value like this:

<xlrd.sheet.Sheet object at 0x106af8898>
Copy the code

To see what methods a sheet has, print(dir(sheet)). Find an nrows method from the printed result, sheet.nrows returns the number of rows in the sheet. We will use NRows to iterate over the contents of each row.

import xlrd

book = xlrd.open_workbook('./resource/data.xlsx')

sheet = book.sheet_by_name('Table 9')

for i in range(sheet.nrows):
    print(sheet.row_values(i))
Copy the code

Running the program yields the following output:

Once you’ve taken the data from the table, it’s time to figure out how to format the data to extract useful information. There are many formats for extracting information. Here we use one of them:

{
  u'Afghanistan': {
		'child_labor': {
			'female': [9.6, ' '].'male': [11.0, ' '].'total': [10.3, ' ']},'child_marriage': {
	     'married_by_15': [15.0, ' '].'married_by_18': [40.4, ' ']
		} 
	}, 
  u'Albania': {
   	'child_labor': {
     	'female': [9.4, u' '].'male': [14.4, u' '].'total': [12.0, u' ']},'child_marriage': {
    	'married_by_15': [0.2, ' '].'married_by_18': [9.6, ' ']}},... }Copy the code

How do YOU determine the starting line of useful data

Once you can read Excel data, you need to extract useful information from it. It is important to know how to extract key data from complex data.

Method 1: Use software to open Excel for intuitive judgment

First of all, the easiest way is to use the software to open the Excel file visually, as shown in the picture below:

The format we defined above is country key, so the country should be found first. Look at the Excel table and display the country data starting at row 15. The data of Child Labour and Child marriage are from column E to column N.

Method two: use the program to test many times

If you don’t want to use the first method, or if you don’t have software on your computer to open the file, try the second method: write code many times.

This method uses the counter principle. Print the first 10 lines to see if there is any data you want. If there are no more lines from 11 to 20, check each section until you determine the exact number of lines.

The code is as follows:

import xlrd

book = xlrd.open_workbook('./resource/data.xlsx')

sheet = book.sheet_by_name('Table 9')

count = 0
for i in range(sheet.nrows):
    if count < 10:
        row = sheet.row_values(i)
        print(i, row)
    count += 1
Copy the code

First print and check the first 10 lines, check the console output did not find the desired country data, continue to adjust the test:

import xlrd

book = xlrd.open_workbook('./resource/data.xlsx')

sheet = book.sheet_by_name('Table 9')

count = 0
for i in range(10, sheet.nrows):
    if count < 10:
        row = sheet.row_values(i)
        print(i, row)
    count += 1
Copy the code

We already know that the first 10 lines don’t have the data we want, so range is just changed to range(10, sheet.nrows) and prints from line 10, leaving the rest unchanged. Run the program again and get the following output:

You can see that the country name appears from line 14, and that’s the data we’re looking for.

Three, assembly data

After finding the desired data in rows and columns, you can write code to extract and assemble the data in the format defined previously.

import xlrd
import pprint

book = xlrd.open_workbook('./resource/data.xlsx')

sheet = book.sheet_by_name('Table 9')

Define a dictionary for storing data
data = {}
for i in range(14, sheet.nrows):
    row = sheet.row_values(i)
    Extract the country name
    country = row[1]
    Assemble data according to the given format
    data[country] = {
        'child_labor': {
            'total': [row[4], row[5]],
            'male': [row[6], row[7]],
            'female': [row[8], row[9]],},'child_marriage': {
            'married_by_15': [row[10], row[11]],
            'married_by_18': [row[12], row[13]].}}# The last country on the road was Zimbabwe. Break out of the loop
    if country == 'Zimbabwe':
        break

# Print data
pprint.pprint(data)
Copy the code

The pprint format is more aesthetically pleasing when printing complex objects.

Four,

Work with three Excel libraries: XLRD, XLWT, and Xlutils. Decide which libraries to use as needed.
After parsing Excel, there are two ways to determine the location of the data you want: open the graphical interface for direct observation and filter the data step by step through the program. If you don’t know what commands an object has, can you printdir(obj)Where obj is the object that you want to view the relevant command.
Figure out what format you want to output in advance, and it will be easier to assemble your data with a format.
The pprint format is more aesthetically pleasing when printing complex objects.

That’s the complete tutorial on parsing Excel data in Python. Stay tuned for the next section on working with PDF files and solving problems with Python.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python data Processing (2) : Processing Excel data

Install the Python package

Parsing Excel files

How do YOU determine the starting line of useful data

Method 1: Use software to open Excel for intuitive judgment

Method two: use the program to test many times

Three, assembly data

Four,

Python data Processing (2) : Processing Excel data

Install the Python package

Parsing Excel files

How do YOU determine the starting line of useful data

Method 1: Use software to open Excel for intuitive judgment

Method two: use the program to test many times

Three, assembly data

Four,

Related Posts

Nio Tomcat source code

ES queries the document by ID

Java Application performance tuning is easy with this visual tool…