Python processes Excel data in batches and imports it to SQL Server

This is the 11th day of my participation in the First Challenge 2022

1, the preface

Following up on the previous article, we have installed and configured SQL Server, and successfully tested how to use Python to connect and read data to the database.

Today we’re starting to talk about requirements: There’s a lot of Excel that needs to be batched and stored in different spreadsheets.

2. Start using your brain

2.1 Disassembly + clear requirements

1) What needs to be modified in excel data?

There’s a column of dataDocketDateExcel is a short time value, which needs to be converted to the normal format of year month day;

eg. 44567 –> 2022/1/6

Part of the data needs to be processed repeatedly according to the SOID, according to the DocketDate to retain the latest data;
There is a column of data that needs to be converted to date format.

eg. 06/Jan/2022 12:27 –> 2022-1-6

Mainly involves: date format processing, data reprocessing

2) Does each Excel correspond to a different table? Is the table name the same as the Excel attachment name?

Some Excel files correspond to the same table, some are separate
The table name is not the same as the Excel attachment name, but there is a correspondence

Eg. Attachment test1 and test2 correspond to table testa, and attachment test3 to testb

Mainly involved: data combination processing

2.2 Installing third-party Packages

pip3 install sqlalchemy pymssql pandas xlrd xlwt
Copy the code

Sqlalchemy: you can map the table structure of a relational database to objects, and then process the database content by processing objects;
Pymssql: a python driver used to connect to the sqlserver database. It can also be used to read and write data after connecting to the database.
Pandas: There are many built-in methods for handling data.
XLRD XLWT: Reading and writing Excel files. Pandas reading and writing Excel calls them.

The import package:

import pandas as pd
from datetime import date, timedelta, datetime
import time
import os
from sqlalchemy import create_engine
import pymssql
Copy the code

2.3 Reading Excel Data

Reading the data is simple. The read_excel function is called to pandas, and the file can be customized if it has a special format, such as encoding.

# Read excel data
def get_excel_data(filepath) :
    data = pd.read_excel(filepath)
    
    return data
Copy the code

2.4 Special data processing

1) The number of days is shortened

In Excel, it is easy to transfer data directly. Select the data you want to transfer, and then select the short date in the Start – Data Format bar.

At first glance, I didn’t know the conversion rule, and searched for a long time, but I didn’t find any similar problems or explanations. First of all, it was definitely not the time stamp, because it always seemed to be related. Finally, I found that it was the day, and I could solve other data conversion problems by calculating the day and calculating the start date.

First we need to judge the null value, then set the date number to calculate the start time, use the datetime module’s timedelta function to convert the time number to the time difference, and then directly calculate with the start date to get the date it represents.

# Date days turn to shorter dates
def days_to_date(days) :
    # Handle nan values
    if pd.isna(days):
        return 
    # 44567 2022/1/6
    # Calculate the excel days shortening date from 1899.12.30
    start = date(1899.12.30) 
    Convert days to timedelta, which can be computed directly with the date
    delta = timedelta(days)
    The start date + time difference yields the corresponding short date
    offset = start + delta
    return offset
Copy the code

It’s hard to figure out the starting date for days, but it’s actually easy to figure it out, because in Excel we can just convert days and days to short dates, and we already have an equation, and we only have one unknown x, so we just need to write an equation with one variable to solve for x.

from datetime import date, timedelta

date_days = 44567
Convert days to date-type time interval
delta = timedelta(date_days)
# Result date
result = date(2022.1.6)
# calculate unknown start date
x = result - delta
print(x)

Output: 1899-12-30"
Copy the code

2) Convert the English of the date into numbers

At first, I thought of using regular matching to take out the year, month and day, and then convert the English month into numbers. Later, I found that I could directly identify the English month in the date.

The original data is 06/Jan/2022 12:27 (number day/English month/number year number hour: number minute). You can replace the data according to the corresponding relationship in the date formatting symbol interpretation table.

Convert the official date format to a common format
def date_to_common(time) :
    # Handle nan values
    if pd.isna(time):
        return 
    # 06/Jan/2022 12:27 2022-1-6
    Print (time,':', type(time))
    # convert string to date
    time_format = datetime.strptime(time,'%d/%b/%Y %H:%M') 
    Convert to the specified date format
    common_date = datetime.strftime(time_format, '%Y-%m-%d') 
    return common_date
Copy the code

Date formatting symbol interpretation table

CSDN- Boat messenger

3) Remove weight according to order number SOID

In addition to de-duplicating the specified column, you also need to keep the latest data by date.

The pandas sort_values function is used to sort all data in ascending order based on the date column. Then, the Drop_duplicates function specifies the SOID column for duplicates. The keep value is set to last, indicating that the last row of repeated data is retained.

The code is as follows:

SOID repeat removes the earliest data by date
def delete_repeat(data) :
    Docket rec. Date & Time sort by Date. Default descending order ensures that the remaining Date is the most recent
    data.sort_values(by=['Docket Rec.Date & Time'], inplace=True)
    Delete duplicate rows by SOID
    data.drop_duplicates(subset=['SOID #'], keep='last', inplace=True)
    
    return data
Copy the code

2.5 Other Requirements

Multiple Excel data for a single database table

You can write a dictionary to store the database tables and the corresponding Excel data names, and then store them one by one in the corresponding database tables (or after processing the data in advance, then merge it).

Merge Excel tables of the same type

# Same table merge data incoming merge Excel list
def merge_excel(elist, files_path) :
    data_list = [get_excel_data(files_path+i) for i in elist]
    data = pd.concat(data_list)
    return data
Copy the code

Pass a list of names of the same type of Excel file (elist) and the absolute/relative path of the data store folder (files_path). The absolute/relative path of the Excel data sheet file can be obtained by using a list of names of the files. Call get_excel_data to read the data.

The table is iterated to read the table data using table comprehensions. The concat function is used to merge the table data.

Data is stored to sqlserver

Initialize the database connection engine
# create_engine(" database type + database driver :// database username: database password @IP address: port/database ", other parameters)
engine = create_engine("mssql+pymssql://sa:123456@localhost/study? charset=GBK")

# Store data
def data_to_sql(data, table_naem, columns) :
    Select the specified column and store it in the database
    data1 = data[columns]
    
    The first parameter is the table name
    # second parameter: database connection engine
    # third parameter: whether to store index
    4. Append data if the table exists
    t1 = time.time()  # timestamp in seconds
    print('Data insertion start time: {0}'.format(t1))
    data1.to_sql(table_naem, engine, index=False, if_exists='append')
    t2 = time.time()  # timestamp in seconds
    print('Data insert end time: {0}'.format(t2))
    print('Insert data %d successfully,'%len(data1), 'Elapsed time: %.5f seconds. '%(t2-t1))
Copy the code

Sqlalchemy + PymSSQL sqlServer To specify the database encoding, slqServer creates the default database GBK encoding, about sqlServer installation use can be seen in the article Windows download install SQL Server, SSMS, use Python connection read and write data.

2.6 Complete call code

Batch all Excel data.
Data files are stored in a specified directory.
files_path = './data/'
bf_path = './process/'

Get all files in the current directory
# files = os.listdir(files_path)
# files

Table name: attached Excel name
data_dict = {
    'testa': ['test1.xls'.'test2.xls'].'testb': ['test3.xls'].'testc': ['test4.xls']}Select the specified column in the attachment and store only the specified column data
columns_a = ['S/No'.'SOID #'.'Current MileStone'.'Store In Date Time']
columns_b = ['Received Part Serial No'.'Received Product Category'.'Received Part Desc']
columns_c = ['From Loc'.'Orig Dispoition Code']

columns = [columns_a, columns_b, columns_c]
flag = 0  # column select tag

# traversal dictionary merge related Excel and then process the data and store it in SQL
for k,v in data_dict.items():
    table_name = k
    data = merge_excel(v, files_path)
    # 1. Work with data
    if 'SOID #' not in data.columns:
        If no columns are to be processed, simply delete the columns and store them in the database
        data.drop_duplicates(inplace=True)
    else:
        # Special processing of data
        data = process_data(data)
    # 2. Store data
    Keep a copy locally for insurance purposes
    data.to_excel(bf_path+table_name+'.xls')
    Store to database
    data_to_sql(data, table_name, columns[flag])
    flag+=1
Copy the code

This article has been liked 20 times, and all relevant data and code of this article have been opened recently.

3. The last word

Welcome to request ah, must have: clear request (column point) + sample data (package send).

Zip ---- data file ---- Requirement description file ---- Other supplementary filesCopy the code

You can send the requirements and the corresponding requirements specification + sample data compression package file to my email [email protected].

Recently the New Year at home, relatively enough time, welcome to exchange ~

Persistence and hard work: results.

Ideas are complex, implementation is fun, as long as you don’t give up, there will be fame.

— Old Watch doggerel

Like to see the message forwarding, four support, the original is not easy. Ok, see you next time, I love the cat love technology, more love si si’s old cousin Da Mian ଘ(˙꒳˙)ଓ Di Di