Pandas is used to read Excel data and convert it to a DataFrame. It is very difficult for Pandas to read data in an unstructured Excel. For example, it is necessary to read data in an unstructured Excel tree

The test Excel used in this article is as follows

The file is available at the end of this article

Specified column read

Normally, when we use the read_excel function to read Excel data, it starts from column A by default. However, for some Excel data, it usually does not start from column A. In this case, we need the parameter USecols to avoid processing

For example, if we read the Excel data above directly using read_excel(src_file), we would get the following result

We get a lot of unnamed columns and a lot of column data that we don’t need at all

At this point we can use usecols to specify which column data to read

from pathlib import Path
src_file = Path.cwd() /  'shipping_tables.xlsx'

df = pd.read_excel(src_file, header=1, usecols='B:F')
Copy the code

You can see that the generated DataFrame contains only the data we need, deliberately excluding notes columns and date fields

Usecols can take a range of Excel columns, such as B:F, and only read those columns. The header argument requires an integer that defines the header column, and its index starts at 0, so we pass in 1, the second row in Excel

We can also define a column as a list of numbers

df = pd.read_excel(src_file, header=1, usecols=[1.2.3.4.5])
Copy the code

You can also select the desired column data by column name

df = pd.read_excel(
    src_file,
    header=1,
    usecols=['item_type'.'order id'.'order date'.'state'.'priority'])
Copy the code

This is useful when the order of the columns changes but the column names remain the same

Finally, usecols can accept a callable function

def column_check(x) :
    if 'unnamed' in x.lower():
        return False
    if 'priority' in x.lower():
        return False
    if 'order' in x.lower():
        return True
    return True

df = pd.read_excel(src_file, header=1, usecols=column_check)
Copy the code

This function resolves each column by name and must return True or False for each column

Lambda expressions can also be used

cols_to_use = ['item_type'.'order id'.'order date'.'state'.'priority']
df = pd.read_excel(src_file,
                   header=1,
                   usecols=lambda x: x.lower() in cols_to_use)
Copy the code

Scope and form

In some cases, the data in Excel can be even more uncertain. In our Excel data, we have a table named ship_cost that we want to read. How do we get that

In this case, we can use OpenPyXL directly to parse the Excel file and convert the data to pandas DataFrame

Here is how to read Excel files using OpenPyXL (after installation) :

from openpyxl import load_workbook
import pandas as pd
from pathlib import Path
src_file = src_file = Path.cwd() / 'shipping_tables.xlsx'

wb = load_workbook(filename = src_file)
Copy the code

Look at all the sheets, get a sheet, get Excel range data

wb.sheetnames
sheet = wb['shipping_rates']
lookup_table = sheet.tables['ship_cost']
lookup_table.ref
Copy the code

Now that we know the data range to load, we need to convert the range to Pandas DataFrame

Get the data range
data = sheet[lookup_table.ref]
rows_list = []

# loop to get data
for row in data:
    cols = []
    for col in row:
        cols.append(col.value)
    rows_list.append(cols)


df = pd.DataFrame(data=rows_list[1:], index=None, columns=rows_list[0])
Copy the code

So we get clean table data

Ok, that’s all for today’s two tips. See you next time!

Need complete code and test Excel data, little in view, wechat access!