Original link:tecdat.cn/?p=17748

Original source:Tuo End number according to the tribe public number

 

 

On my data science learning journey, I often work with time series data sets from my daily work and make predictions based on them.

I will go through the following steps:

Exploratory Data Analysis (EDA)

  • Problem Definition (What are we trying to solve)
  • Variable identification (what data do we have)
  • Univariate analysis (understanding each field in the data set)
  • Multivariate analysis (understanding the interaction between different domains and objectives)
  • Missing value handling
  • Outlier processing
  • The variable transformation

Predictive modeling

  • LSTM
  • XGBoost

Problem definition

We provide the following information for the store in two different tables:

  • Store: The ID of each store
  • Sales: Turnover on a given date (our target variable)
  • Customers: The number of customers on a particular date
  • StateHoliday: holiday
  • -blair: SchoolHoliday
  • StoreType: 4 different stores: A, B, C, AND D
  • CompetitionDistance: distance (in meters) to the nearest competitor’s store
  • CompetitionOpenSince [month/year] : Provides approximate year and month of the most recent competitor opening
  • Promotion: Whether or not it is on sale that day
  • Promo2: Promo2 is a continuous and continuous promotion by some stores: 0 = the store is not participating, 1 = the store is participating
  • PromoInterval: Describes the sequential interval in which a promotion starts and specifies the month in which the promotion restarts.

Using all this information, we forecast sales for the next six weeks.

 

Let's import EDA libraries: Import python as np # import python as pd # CSV file I/O import (for example, pd.read_csv) import matplotlib.pyplot as PLT import seaborn as SNS from datetime import datetime Plt.style. use("ggplot") # plot # import training and test files: train_df = pd.read_csv(".. /Data/train.csv") test_df = pd.read_csv(".. Print (" in the training set, we have ", train_df.shape[0], "observation values and ", train_df.shape[1], column/variable ") Print (" In test set, we have ", test_df.shape[0], "and ", test_df.shape[1]," column/variable. Print (" In store set, we have ", store_df.shape[0], "and ", store_df.shape[1]," column/variable. )Copy the code

In the training set, we had 1,017209 observations and 9 columns/variables. In the test set, we had 41088 observations and 8 columns/variables. In the store set, we have 1115 observations and 10 columns/variables.

First let’s clean up the training data set.

 

Train_df.head ().append(train_df.tail())Copy the code

 

train_df.isnull().all()
Out[5]:

Store            False
DayOfWeek        False
Date             False
Sales            False
Customers        False
Open             False
Promo            False
StateHoliday     False
SchoolHoliday    False
dtype: bool
Copy the code

Let’s start with the first variable -> sales volume

Opened_sales = (train_df[train_df.open == 1) Count 422307.000000 mean 6951.782199 STD 3101.768685 min 133.000000 25% 4853.000000 50% 6367.000000 75% 8355.000000 Max 41551.000000 Name: Sales, DType: float64 <matplotlib.axes._subplots.AxesSubplot at 0x7f7c38FA6588 >Copy the code

 

Look at the customer variable

In [9]:

train_df.Customers.describe()
Out[9]:

count    1.017209e+06
mean     6.331459e+02
std      4.644117e+02
min      0.000000e+00
25%      4.050000e+02
50%      6.090000e+02
75%      8.370000e+02
max      7.388000e+03
Name: Customers, dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x7f7c3565d240>
Copy the code

 
Copy the code
train_df[(train_df.Customers > 6000)]
Copy the code

 

Let's look at the vacation variable.

 
Copy the code
train_df.StateHoliday.value_counts()
Copy the code
 
Copy the code
0    855087
0    131072
a     20260
b      6690
c      4100
Name: StateHoliday, dtype: int64
Copy the code

 

train_df.StateHoliday_cat.count()
Copy the code

 

1017209
Copy the code

 

train_df.tail()
Copy the code

 
Copy the code
Train_df.isnull ().all() # check for missing Out[18]: Store False DayOfWeek False Date False Sales False Customers False Open False Promo False SchoolHoliday False StateHoliday_cat False dtype: boolCopy the code

Let’s move on to store analysis

 

store_df.head().append(store_df.tail())
Copy the code

 

# Missing data: Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth 0.269058 0.000000 0.000000 0.000000 31.748879 CompetitionOpenSinceYear 31.748879 Promo2 0.000000 Promo2SinceWeek 48.789238 Promo2SinceYear 48.789238 PromoInterval 48.789238 DType: Float64 In [21]:Copy the code

Let’s start with the missing data. The first is CompetitionDistance


store_df.CompetitionDistance.plot.box() 
Copy the code

Let me look at the outliers, so we can choose between the mean and the median to populate NaN

 

Lack of data because stores don't compete. Therefore, I recommend padding the missing values with zeros.Copy the code

store_df["CompetitionOpenSinceMonth"].fillna(0, inplace = True)
Copy the code

Let’s take a look at the promotion.

 

store_df.groupby(by = "Promo2", axis = 0).count() 
Copy the code

 

If there is no promotion, replace NaN in “Promotion” with zero

We merge the store data with the training set data, and then continue our analysis.

First, let’s compare stores by sales volume, customers, etc.

 

F, axplots = plt.plots (2, 3, figsize = (20,10)) plt.plot (2, 3, figsize = (20,10)) plt.plot (2, 3, figsize = (20,10))Copy the code

 

As can be seen from the figure, StoreType A has the most stores, sales and customers. However, StoreType D has the highest average spending per customer. StoreType B, with just 17 stores, has the most average customers.

 

We look at trends year by year.

 

SNS. Factorplot (data = train_store_df, # We can see seasonality but not trend. The sales remain the same per year < seaborn. Axisgrid. FacetGrid at 0 x7f7c350e0c50 >Copy the code



 

Let’s look at the correlation diagram.

  "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Promo2

<matplotlib.axes._subplots.AxesSubplot at 0x7f7c33d79c18>
Copy the code

 

 

 

We can get the correlation:

  • Customer and Sales (0.82)
  • Promotion and sales (0,82)
  • Average customer sales vs. sales (0,28)
  • Store category vs. average customer sales (0,44)

My analysis:

  • Store category A has the most sales and customers.
  • Store category B had the lowest average sales per customer. Therefore, I think customers only come for small goods.
  • Store category D has the largest number of shopping carts.
  • The sale is only available on weekdays.
  • Customers tend to buy more on Mondays (sales) and Sundays (no sales).
  • I don’t see any annual trends. Seasonal patterns only.

Most welcome insight

1. Use LSTM and PyTorch for time series prediction in Python

2. Long and short-term memory model LSTM is used in Python for time series prediction analysis

3. Time series (ARIMA, exponential smoothing) analysis using R language

4. R language multivariate Copula – Garch – model time series prediction

5. R language Copulas and financial time series cases

6. Use R language random wave model SV to process random fluctuations in time series

7. Tar threshold autoregressive model for R language time series

8. R language K-Shape time series clustering method for stock price time series clustering

Python3 uses ARIMA model for time series prediction