Original link:tecdat.cn/?p=9024

Original source:Tuo End number according to the tribe public number

 

 

Time series are modeled with GAM

I have prepared a file containing four time series of electricity consumption for analysis. Data manipulation will be done by the data.table package.

Read the mentioned smart meter data into data.table.

DT <- as.data.table(read_feather("DT_4_ind"))
Copy the code

GAM regression model was used. Convert the characters of the workday to integers and re-encode the workday: 1 using the function in the Recode package. On Monday… 7 Sunday.

DT[, week_num := as.integer(car::recode(week,
    "'Monday'='1'; 'Tuesday'='2'; 'Wednesday'='3'; 'Thursday'='4'; 'Friday'='5'; 'Saturday'='6'; 'Sunday'='7'")))Copy the code

Store the information in date variables to simplify your work.

n_type <- unique(DT[, type])
n_date <- unique(DT[, date])
n_weekdays <- unique(DT[, week])
period <- 48
Copy the code

Let’s take a look at some data on electricity consumption and analyze it.

data_r <- DT[(type == n_type[1] & date %in% n_date[57:70])]
 
ggplot(data_r, aes(date_time, value)) +
  geom_line() +
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        panel.grid.minor = element_line(colour = "grey90"),
        panel.grid.major = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold")) +
  labs(x = "Date", y = "Load (kW)")
Copy the code

Two main seasonalities can be seen in the plotted time series: daily and weekly. We have 48 measurements in a day and 7 days in a week, so this will be the independent variable that we use to model the dependent variable, the power load.

Training our first GAM. The independent variables are modeled by the smoothing function S, and the cubic spline regression is used for daily seasonality, and the p-spline is used for weekly seasonality.

gam_1 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
               s(Weekly, bs = "ps", k = 7),
             data = matrix_gam,
             family = gaussian)
Copy the code

The first is visualization.

layout(matrix(1:2, nrow = 1))
plot(gam_1, shade = TRUE)
Copy the code

Here we can see the effect of variables on power load. On the left, daytime load peaks around 3 p.m. In the picture on the right, we can see that the load decreases over the weekend.

Let’s use the summary function to diagnose the first model.

## ## Family: gaussian ## Link function: identity ## ## Formula: ## Load ~ s(Daily, bs = "cr", k = period) + s(Weekly, bs = "ps", ## k = 7) ## ## Parametric coefficients: # # Estimate Std. Error t value (Pr > | | t) # # (Intercept) 2731.67 18.88 144.7 < 2 * * * e - 16 # # - # # Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '1 ## ## Approximate significance of smooth terms: ## edf ref. df p-value ## s(Daily) 10.159 12.688 119.8 <2e-16 *** ## s(Weekly) 5.311 5.758 130.3 <2e-16 *** ## - ## Signif. codes: 0 '* * *' 0.001 '* *' 0.01 '*' 0.05 ' ' ' '1 # # # # R - 0.1 sq. (adj) = 0.772 Deviance explained 77.7% = # # GCV e+05 = 2.4554 Scale est. = 2.3953e+05 n = 672Copy the code

EDF: Estimated degrees of freedom – can be interpreted as smoothing for a given variable (higher EDF values indicate more complex splines). P-value: The statistical significance of a given variable over a dependent variable, tested by the F-test (the lower the better). Adjusted R squared (the higher the better). We can see that the r-sq. (adj.) value is a little low.

Let’s plot the fitting values:

We need to include the interaction of the two independent variables in the model.

The first interaction type uses a smoothing function for both variables.

gam_2 <- gam(Load ~ s(Daily, Weekly),
          
 
summary(gam_2)$r.sq
Copy the code
# # 0.9352108 [1]Copy the code

The R squared value is much better.

summary(gam_2)$s.table
Copy the code
## edf ref. df F p-value ## s(Daily,Weekly) 28.7008 28.99423 334.4754 0Copy the code

That seems to be good, too. P is zero, which means the independent variable is important. Fitting value diagram:

Now, let’s try the tensor product interaction described above. This can be done with function, or you can define basic functions.

# # 0.9268452 [1]Copy the code

Similar to the previous model gam_2.

summary(gam_3)$s.table
Copy the code
## edf ref. df F p-value ## te(Daily,WeeklyCopy the code

Very similar results. Let’s look at the fitting values:

There is only a slight difference compared to the GAM_2 model and it looks like TE fits better.

# # 0.9727604 [1]Copy the code
summary(gam_4)$sp.criterion
Copy the code
# # GCV. Cp # # 34839.46Copy the code
summary(gam_4)$s.table
Copy the code
## edf ref. df F p-value ## te(Daily,Weekly) 119.4117 149.6528 160.2065 0Copy the code

We can see here that R squared goes up. Let’s plot the fitting values:

This seems much better than the GAM_3 model.

# # 0.965618 [1]Copy the code
summary(gam_4_fx)$s.table
Copy the code
## edf ref. df F p-value ## te(Daily,Weekly) 335 335 57.25389 5.289648e-199Copy the code

We can see that R squared is lower than the model GAM_4 because we overfit the model. Prove that the GCV program (lambda and EDF estimates) works properly.

So let’s try the TI approach in a case (model).

# # 0.9717469 [1]Copy the code
summary(gam_5)$sp.criterion
Copy the code
# # GCV. Cp # # 35772.35Copy the code
summary(gam_5)$s.table
Copy the code
## edf ref. df F p-value ## s(Daily) 22.583649 27.964970 444.19962 0 ## s(Weekly) 5.914531 5.995934 1014.72482 0 ## Ti (Daily,Weekly) 85.310314 110.828814 41.22288 0Copy the code

And then t2.

# # 0.9738273 [1]Copy the code
summary(gam_6)$sp.criterion
Copy the code
# # GCV. Cp # # 32230.68Copy the code
summary(gam_6)$s.table
Copy the code
## edf ref. df F p-value ## t2(Daily,Weekly) 98.12005 120.2345 86.70754 0Copy the code

I also output the GCV score values for the last three models, which are also good criteria for selecting the best model in a set of fitting models. It can be seen that GCV value is the lowest for GAM_6 corresponding to T2.

Another model selection criterion widely used in statistics is AIC (Akaike Information Criterion). Let’s look at three models:

AIC(gam_4, gam_5, gam_6)
Copy the code
##             df      AIC
## gam_4 121.4117 8912.611
## gam_5 115.8085 8932.746
## gam_6 100.1200 8868.628
Copy the code

The lowest value is in the GAM_6 model. Let’s look at the fit values again.

We can see that the model fit values gam_4 and GAM_6 are very similar. You can compare the two models using the package’s more visual and model diagnostic capabilities.

The first one is function gam.check, which draws four graphs: QQ graph of residuals, linear predictive variables and residuals, histogram of residuals, and relationship graph of fitting values and dependent variables. Let’s diagnose models GAM_4 and GAM_6.

gam.check(gam_4)
Copy the code

## ## Method: GCV Optimizer: magic ## Smoothing parameter selection converged after 7 iterations. ## The RMS GCV score gradiant at convergence was 0.2833304. ## The Hessian was positive definite. ## The estimated Model rank was 336 (maximum possible: 336) ## Model rank = 336 / 336 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that K is too low, especially if edf is close to k'. ## ## k 'edf k-index p-value ## te(Daily,Weekly) 335.00 119.41 1.22 1Copy the code
gam.check(gam_6)
Copy the code

## ## Method: GCV Optimizer: magic ## Smoothing parameter selection converged after 9 iterations. ## The RMS GCV score gradiant at convergence was ## The estimated Model rank was 336 (maximum possible: 336) ## Model rank = 336 / 336 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that K is too low, especially if edf is close to k'. ## ## k 'edf k-index p-value ## t2(Daily,Weekly) 335.00 98.12 1.18 1Copy the code

Again, we can see that the models are very similar, with only some differences in the histogram.

 

layout(matrix(1:2, nrow = 1))
plot(gam_4, rug = FALSE, se = FALSE, n2 = 80, main = "gam n.4 with te()")
plot(gam_6, rug = FALSE, se = FALSE, n2 = 80, main = "gam n.6 with t2()")
Copy the code

The model GAM_6 has more of a “wavy” profile. Therefore, this means that it has a higher degree of fit for the dependent variable and a lower smoothing factor.

 

vis.gam(gam_6, n.grid = 50, theta = 35, phi = 32, zlab = "",
        ticktype = "detailed", color = "topo", main = "t2(D, W)")
Copy the code

We can see that the highest peak is near 30 (3pm) for the Daily variable and 1 (Monday) for the Weekly variable.

 

vis.gam(gam_6, main = "t2(D, W)", plot.type = "contour",
        color = "terrain", contour.col = "black", lwd = 2)
Copy the code

Again, you can see that the peak power load is at 3:00 p.m. on Monday, very similar until Thursday, and then the load decreases over the weekend.

 


Most welcome insight

1. Use LSTM and PyTorch for time series prediction in Python

2. Long and short-term memory model LSTM is used in Python for time series prediction analysis

3. Time series (ARIMA, exponential smoothing) analysis using R language

4. R language multivariate Copula – Garch – model time series prediction

5. R language Copulas and financial time series cases

6. Use R language random wave model SV to process random fluctuations in time series

7. Tar threshold autoregressive model for R language time series

8. R language K-Shape time series clustering method for stock price time series clustering

Python3 uses ARIMA model for time series prediction