Pandas is used to read Excel data and convert it to a DataFrame. It is very difficult for Pandas to read data in an unstructured Excel. For example, it is necessary to read data in an unstructured Excel tree
The test Excel used in this article is as follows
The file is available at the end of this article
Specified column read
Normally, when we use the read_excel function to read Excel data, it starts from column A by default. However, for some Excel data, it usually does not start from column A. In this case, we need the parameter USecols to avoid processing
For example, if we read the Excel data above directly using read_excel(src_file), we would get the following result
We get a lot of unnamed columns and a lot of column data that we don’t need at all
At this point we can use usecols to specify which column data to read
from pathlib import Path
src_file = Path.cwd() / 'shipping_tables.xlsx'
df = pd.read_excel(src_file, header=1, usecols='B:F')
Copy the code
You can see that the generated DataFrame contains only the data we need, deliberately excluding notes columns and date fields
Usecols can take a range of Excel columns, such as B:F, and only read those columns. The header argument requires an integer that defines the header column, and its index starts at 0, so we pass in 1, the second row in Excel
We can also define a column as a list of numbers
df = pd.read_excel(src_file, header=1, usecols=[1.2.3.4.5])
Copy the code
You can also select the desired column data by column name
df = pd.read_excel(
src_file,
header=1,
usecols=['item_type'.'order id'.'order date'.'state'.'priority'])
Copy the code
This is useful when the order of the columns changes but the column names remain the same
Finally, usecols can accept a callable function
def column_check(x) :
if 'unnamed' in x.lower():
return False
if 'priority' in x.lower():
return False
if 'order' in x.lower():
return True
return True
df = pd.read_excel(src_file, header=1, usecols=column_check)
Copy the code
This function resolves each column by name and must return True or False for each column
Lambda expressions can also be used
cols_to_use = ['item_type'.'order id'.'order date'.'state'.'priority']
df = pd.read_excel(src_file,
header=1,
usecols=lambda x: x.lower() in cols_to_use)
Copy the code
Scope and form
In some cases, the data in Excel can be even more uncertain. In our Excel data, we have a table named ship_cost that we want to read. How do we get that
In this case, we can use OpenPyXL directly to parse the Excel file and convert the data to pandas DataFrame
Here is how to read Excel files using OpenPyXL (after installation) :
from openpyxl import load_workbook
import pandas as pd
from pathlib import Path
src_file = src_file = Path.cwd() / 'shipping_tables.xlsx'
wb = load_workbook(filename = src_file)
Copy the code
Look at all the sheets, get a sheet, get Excel range data
wb.sheetnames
sheet = wb['shipping_rates']
lookup_table = sheet.tables['ship_cost']
lookup_table.ref
Copy the code
Now that we know the data range to load, we need to convert the range to Pandas DataFrame
Get the data range
data = sheet[lookup_table.ref]
rows_list = []
# loop to get data
for row in data:
cols = []
for col in row:
cols.append(col.value)
rows_list.append(cols)
df = pd.DataFrame(data=rows_list[1:], index=None, columns=rows_list[0])
Copy the code
So we get clean table data
Ok, that’s all for today’s two tips. See you next time!
Need complete code and test Excel data, little in view, wechat access!