Original link:tecdat.cn/?p=17748
Original source:Tuo End number according to the tribe public number
On my data science learning journey, I often work with time series data sets from my daily work and make predictions based on them.
I will go through the following steps:
Exploratory Data Analysis (EDA)
- Problem Definition (What are we trying to solve)
- Variable identification (what data do we have)
- Univariate analysis (understanding each field in the data set)
- Multivariate analysis (understanding the interaction between different domains and objectives)
- Missing value handling
- Outlier processing
- The variable transformation
Predictive modeling
- LSTM
- XGBoost
Problem definition
We provide the following information for the store in two different tables:
- Store: The ID of each store
- Sales: Turnover on a given date (our target variable)
- Customers: The number of customers on a particular date
- StateHoliday: holiday
- -blair: SchoolHoliday
- StoreType: 4 different stores: A, B, C, AND D
- CompetitionDistance: distance (in meters) to the nearest competitor’s store
- CompetitionOpenSince [month/year] : Provides approximate year and month of the most recent competitor opening
- Promotion: Whether or not it is on sale that day
- Promo2: Promo2 is a continuous and continuous promotion by some stores: 0 = the store is not participating, 1 = the store is participating
- PromoInterval: Describes the sequential interval in which a promotion starts and specifies the month in which the promotion restarts.
Using all this information, we forecast sales for the next six weeks.
Let's import EDA libraries: Import python as np # import python as pd # CSV file I/O import (for example, pd.read_csv) import matplotlib.pyplot as PLT import seaborn as SNS from datetime import datetime Plt.style. use("ggplot") # plot # import training and test files: train_df = pd.read_csv(".. /Data/train.csv") test_df = pd.read_csv(".. Print (" in the training set, we have ", train_df.shape[0], "observation values and ", train_df.shape[1], column/variable ") Print (" In test set, we have ", test_df.shape[0], "and ", test_df.shape[1]," column/variable. Print (" In store set, we have ", store_df.shape[0], "and ", store_df.shape[1]," column/variable. )Copy the code
In the training set, we had 1,017209 observations and 9 columns/variables. In the test set, we had 41088 observations and 8 columns/variables. In the store set, we have 1115 observations and 10 columns/variables.
First let’s clean up the training data set.
Train_df.head ().append(train_df.tail())Copy the code
train_df.isnull().all()
Out[5]:
Store False
DayOfWeek False
Date False
Sales False
Customers False
Open False
Promo False
StateHoliday False
SchoolHoliday False
dtype: bool
Copy the code
Let’s start with the first variable -> sales volume
Opened_sales = (train_df[train_df.open == 1) Count 422307.000000 mean 6951.782199 STD 3101.768685 min 133.000000 25% 4853.000000 50% 6367.000000 75% 8355.000000 Max 41551.000000 Name: Sales, DType: float64 <matplotlib.axes._subplots.AxesSubplot at 0x7f7c38FA6588 >Copy the code
Look at the customer variable
In [9]:
train_df.Customers.describe()
Out[9]:
count 1.017209e+06
mean 6.331459e+02
std 4.644117e+02
min 0.000000e+00
25% 4.050000e+02
50% 6.090000e+02
75% 8.370000e+02
max 7.388000e+03
Name: Customers, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7f7c3565d240>
Copy the code
Copy the code
train_df[(train_df.Customers > 6000)]
Copy the code
Let's look at the vacation variable.
Copy the code
train_df.StateHoliday.value_counts()
Copy the code
Copy the code
0 855087
0 131072
a 20260
b 6690
c 4100
Name: StateHoliday, dtype: int64
Copy the code
train_df.StateHoliday_cat.count()
Copy the code
1017209
Copy the code
train_df.tail()
Copy the code
Copy the code
Train_df.isnull ().all() # check for missing Out[18]: Store False DayOfWeek False Date False Sales False Customers False Open False Promo False SchoolHoliday False StateHoliday_cat False dtype: boolCopy the code
Let’s move on to store analysis
store_df.head().append(store_df.tail())
Copy the code
# Missing data: Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth 0.269058 0.000000 0.000000 0.000000 31.748879 CompetitionOpenSinceYear 31.748879 Promo2 0.000000 Promo2SinceWeek 48.789238 Promo2SinceYear 48.789238 PromoInterval 48.789238 DType: Float64 In [21]:Copy the code
Let’s start with the missing data. The first is CompetitionDistance
store_df.CompetitionDistance.plot.box()
Copy the code
Let me look at the outliers, so we can choose between the mean and the median to populate NaN
Lack of data because stores don't compete. Therefore, I recommend padding the missing values with zeros.Copy the code
store_df["CompetitionOpenSinceMonth"].fillna(0, inplace = True)
Copy the code
Let’s take a look at the promotion.
store_df.groupby(by = "Promo2", axis = 0).count()
Copy the code
If there is no promotion, replace NaN in “Promotion” with zero
We merge the store data with the training set data, and then continue our analysis.
First, let’s compare stores by sales volume, customers, etc.
F, axplots = plt.plots (2, 3, figsize = (20,10)) plt.plot (2, 3, figsize = (20,10)) plt.plot (2, 3, figsize = (20,10))Copy the code
As can be seen from the figure, StoreType A has the most stores, sales and customers. However, StoreType D has the highest average spending per customer. StoreType B, with just 17 stores, has the most average customers.
We look at trends year by year.
SNS. Factorplot (data = train_store_df, # We can see seasonality but not trend. The sales remain the same per year < seaborn. Axisgrid. FacetGrid at 0 x7f7c350e0c50 >Copy the code
Let’s look at the correlation diagram.
"CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Promo2
<matplotlib.axes._subplots.AxesSubplot at 0x7f7c33d79c18>
Copy the code
We can get the correlation:
- Customer and Sales (0.82)
- Promotion and sales (0,82)
- Average customer sales vs. sales (0,28)
- Store category vs. average customer sales (0,44)
My analysis:
- Store category A has the most sales and customers.
- Store category B had the lowest average sales per customer. Therefore, I think customers only come for small goods.
- Store category D has the largest number of shopping carts.
- The sale is only available on weekdays.
- Customers tend to buy more on Mondays (sales) and Sundays (no sales).
- I don’t see any annual trends. Seasonal patterns only.
Most welcome insight
1. Use LSTM and PyTorch for time series prediction in Python
2. Long and short-term memory model LSTM is used in Python for time series prediction analysis
3. Time series (ARIMA, exponential smoothing) analysis using R language
4. R language multivariate Copula – Garch – model time series prediction
5. R language Copulas and financial time series cases
6. Use R language random wave model SV to process random fluctuations in time series
7. Tar threshold autoregressive model for R language time series
8. R language K-Shape time series clustering method for stock price time series clustering
Python3 uses ARIMA model for time series prediction