Original link:tecdat.cn/?p=17748

Original source:Tuo End number according to the tribe public number

On my data science learning journey, I often work with time series data sets from my daily work and make predictions based on them.

I will go through the following steps:

Exploratory Data Analysis (EDA)

Problem Definition (What are we trying to solve)
Variable identification (what data do we have)
Univariate analysis (understanding each field in the data set)
Multivariate analysis (understanding the interaction between different domains and objectives)
Missing value handling
Outlier processing
The variable transformation

Predictive modeling

LSTM
XGBoost

Problem definition

We provide the following information for the store in two different tables:

Store: The ID of each store
Sales: Turnover on a given date (our target variable)
Customers: The number of customers on a particular date
StateHoliday: holiday
-blair: SchoolHoliday
StoreType: 4 different stores: A, B, C, AND D
CompetitionDistance: distance (in meters) to the nearest competitor’s store
CompetitionOpenSince [month/year] : Provides approximate year and month of the most recent competitor opening
Promotion: Whether or not it is on sale that day
Promo2: Promo2 is a continuous and continuous promotion by some stores: 0 = the store is not participating, 1 = the store is participating
PromoInterval: Describes the sequential interval in which a promotion starts and specifies the month in which the promotion restarts.

Using all this information, we forecast sales for the next six weeks.

Let's import EDA libraries: Import python as np # import python as pd # CSV file I/O import (for example, pd.read_csv) import matplotlib.pyplot as PLT import seaborn as SNS from datetime import datetime Plt.style. use("ggplot") # plot # import training and test files: train_df = pd.read_csv(".. /Data/train.csv") test_df = pd.read_csv(".. Print (" in the training set, we have ", train_df.shape[0], "observation values and ", train_df.shape[1], column/variable ") Print (" In test set, we have ", test_df.shape[0], "and ", test_df.shape[1]," column/variable. Print (" In store set, we have ", store_df.shape[0], "and ", store_df.shape[1]," column/variable. )Copy the code

In the training set, we had 1,017209 observations and 9 columns/variables. In the test set, we had 41088 observations and 8 columns/variables. In the store set, we have 1115 observations and 10 columns/variables.

First let’s clean up the training data set.

Train_df.head ().append(train_df.tail())Copy the code

train_df.isnull().all()
Out[5]:

Store            False
DayOfWeek        False
Date             False
Sales            False
Customers        False
Open             False
Promo            False
StateHoliday     False
SchoolHoliday    False
dtype: bool
Copy the code

Let’s start with the first variable -> sales volume

Opened_sales = (train_df[train_df.open == 1) Count 422307.000000 mean 6951.782199 STD 3101.768685 min 133.000000 25% 4853.000000 50% 6367.000000 75% 8355.000000 Max 41551.000000 Name: Sales, DType: float64 <matplotlib.axes._subplots.AxesSubplot at 0x7f7c38FA6588 >Copy the code

Look at the customer variable

In [9]:

train_df.Customers.describe()
Out[9]:

count    1.017209e+06
mean     6.331459e+02
std      4.644117e+02
min      0.000000e+00
25%      4.050000e+02
50%      6.090000e+02
75%      8.370000e+02
max      7.388000e+03
Name: Customers, dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x7f7c3565d240>
Copy the code

 
Copy the code

train_df[(train_df.Customers > 6000)]
Copy the code

Let's look at the vacation variable.

 
Copy the code

train_df.StateHoliday.value_counts()
Copy the code

 
Copy the code

0    855087
0    131072
a     20260
b      6690
c      4100
Name: StateHoliday, dtype: int64
Copy the code

train_df.StateHoliday_cat.count()
Copy the code

1017209
Copy the code

train_df.tail()
Copy the code

 
Copy the code

Train_df.isnull ().all() # check for missing Out[18]: Store False DayOfWeek False Date False Sales False Customers False Open False Promo False SchoolHoliday False StateHoliday_cat False dtype: boolCopy the code

Let’s move on to store analysis

store_df.head().append(store_df.tail())
Copy the code

# Missing data: Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth 0.269058 0.000000 0.000000 0.000000 31.748879 CompetitionOpenSinceYear 31.748879 Promo2 0.000000 Promo2SinceWeek 48.789238 Promo2SinceYear 48.789238 PromoInterval 48.789238 DType: Float64 In [21]:Copy the code

Let’s start with the missing data. The first is CompetitionDistance


store_df.CompetitionDistance.plot.box() 
Copy the code

Let me look at the outliers, so we can choose between the mean and the median to populate NaN

Lack of data because stores don't compete. Therefore, I recommend padding the missing values with zeros.Copy the code


store_df["CompetitionOpenSinceMonth"].fillna(0, inplace = True)
Copy the code

Let’s take a look at the promotion.

store_df.groupby(by = "Promo2", axis = 0).count() 
Copy the code

If there is no promotion, replace NaN in “Promotion” with zero

We merge the store data with the training set data, and then continue our analysis.

First, let’s compare stores by sales volume, customers, etc.

F, axplots = plt.plots (2, 3, figsize = (20,10)) plt.plot (2, 3, figsize = (20,10)) plt.plot (2, 3, figsize = (20,10))Copy the code

As can be seen from the figure, StoreType A has the most stores, sales and customers. However, StoreType D has the highest average spending per customer. StoreType B, with just 17 stores, has the most average customers.

We look at trends year by year.

SNS. Factorplot (data = train_store_df, # We can see seasonality but not trend. The sales remain the same per year < seaborn. Axisgrid. FacetGrid at 0 x7f7c350e0c50 >Copy the code

Let’s look at the correlation diagram.

  "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Promo2

<matplotlib.axes._subplots.AxesSubplot at 0x7f7c33d79c18>
Copy the code

We can get the correlation:

Customer and Sales (0.82)
Promotion and sales (0,82)
Average customer sales vs. sales (0,28)
Store category vs. average customer sales (0,44)

My analysis:

Store category A has the most sales and customers.
Store category B had the lowest average sales per customer. Therefore, I think customers only come for small goods.
Store category D has the largest number of shopping carts.
The sale is only available on weekdays.
Customers tend to buy more on Mondays (sales) and Sundays (no sales).
I don’t see any annual trends. Seasonal patterns only.

Most welcome insight

1. Use LSTM and PyTorch for time series prediction in Python

2. Long and short-term memory model LSTM is used in Python for time series prediction analysis

3. Time series (ARIMA, exponential smoothing) analysis using R language

4. R language multivariate Copula – Garch – model time series prediction

5. R language Copulas and financial time series cases

6. Use R language random wave model SV to process random fluctuations in time series

7. Tar threshold autoregressive model for R language time series

8. R language K-Shape time series clustering method for stock price time series clustering

Python3 uses ARIMA model for time series prediction

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python performs LSTM and XgBoost sales time series modeling prediction analysis on store data

Original link:tecdat.cn/?p=17748

Original source:Tuo End number according to the tribe public number

Exploratory Data Analysis (EDA)

Predictive modeling

Problem definition

My analysis:

Python performs LSTM and XgBoost sales time series modeling prediction analysis on store data

Original link:tecdat.cn/?p=17748

Original source:Tuo End number according to the tribe public number

Exploratory Data Analysis (EDA)

Predictive modeling

Problem definition

My analysis:

Related Posts

Prometheus AlertManager Alarm Convergence Building Monitoring System X (Advanced)

Git general operation and parsing based on fork

MAC OS Go upgrade 1.12 to 1.14