Original link:tecdat.cn/?p=20953
Original source:Tuo End number according to the tribe public number
The preface
This paper demonstrates the application of distributed lag linear and nonlinear models (DLMs and DLNMs) in time series analysis. Gasparrini et al. [2010] and Gasparrini[2011] described the development of DLMs and DLNMs and the implementation of time series data. The examples described in this article cover most of the standard applications of DLNM methods for time series data and explore the DLNM packages used to specify, summarize, and draw such models. Although these examples have specific applications to air pollution and the effects of temperature on health, they can easily be generalized to different topics and form the basis for analyzing these data sets or other time series data sources.
data
The example explores the relationship between air pollution and temperature and mortality using a time series data set, including daily observations from 1987-2000. After loading in R session, let’s take a look at the first three observations:
Date Time Year Month Doy dow Death CVD Resp Temp DPTP 1 1987-01-01 1 1987 1 1 Thursday 130 65 13-0.2777778 31.500 2 1987-01-02 2 1987 1 2 Friday 150 73 14 0.5555556 29.875 3 1987-01-03 3 1987 1 3 Saturday 101 43 11 0.5555556 27.375 rhum Pm10 O3 1 95.50 26.95607 4.376079 2 88.25 NA 4.929803 3 89.50 32.83869 3.751079Copy the code
The dataset consists of daily observation sequences from 1987-2000.
Example 1: A simple DLM
In the first example, I specified a simple DLM that evaluated the effect of PM10 on mortality while adjusting for the effect of temperature. I first build two cross basis matrices for the two predicted values and then include them in the model formula of the regression function. The effect of PM10 is assumed to be linear in the dimension of the predictors, so, from this point of view, we can define it as a simple DLM, even though the regression model also estimates the distributed lag function of temperature, which is a nonlinear term. First, I run Crossbasis () to build two intersecting basis matrices, storing them in two objects. The names of the two objects must be different in order to predict the association between them separately. The code is as follows:
cb(pm10, lag=15, argvar=list(fun="lin",
arglag=list(fun="poly",degree=4
Copy the code
In programs with time series data, the first argument x is used to specify a sequence of vectors. In this case, we assume that the effect of PM10 is linear (fun= “Lin”), while simulating the relationship with temperature through a natural cubic spline with 5 degrees of freedom (Fun = “NS”, default choice). The internal nodes (if not provided) are placed by ns () at the default isometric quantile, while the boundary nodes are at the temperature range. For the cardinality of the lag space, I use a polynomial function of degree 4 (set times =4) to specify the lag effect of PM10 for up to 15 days (minimum lag defaults to 0). The hysteresis effect of temperature is defined by two hysteresis layers (0 and 1-3), assuming that the effect within each layer is constant. The argument breaks=1 defines the lower boundary of the second interval. The method function summary () of this class provides an overview of intersecting bases (and correlation bases in two dimensions) :
CROSSBASIS FUNCTIONS
observations: 5114
range: -3.049835 to 356.1768
lag period: 0 15
total df: 5
BASIS FOR VAR:
fun: lin
intercept: FALSE
BASIS FOR LAG:
fun: poly
degree: 4
scale: 15
intercept: TRUE
Copy the code
These two cross-base objects can now be included in the model formula of the regression model. In this case, I fit the time series model, assuming a Poisson distribution, a smooth function of time, 7 df/ year (to correct for seasonal and long time trends) and the day of the week as factors:
glm(death ~ cb1.pm + cb1.temp + ns(time, 7*14) + dow,
family=quasipoisson()
Copy the code
The estimated association of PM10 predicted by the above model with a particular level of mortality can be summarized by the function Crosspred () and stored in objects with the same class:
Pred (cb1.pm, model1, at=0:20, bylag=0.2, cumul=TRUE)Copy the code
The function includes base1.pm and Model1 objects used to estimate the parameters as the first two parameters, and at = 0:20 means that the prediction must be computed for every integer value from 0 to 20 µgr/m3. By setting BY LAG = 0.2, the prediction is calculated in increments of 0.2 along the lag space. When the results are drawn, this finer mesh produces a smoother lag curve. The cumul parameter (default FALSE) indicates that the incremental cumulative association along the lag must also be included. The center is not defined with the cen argument, so the reference value is set to 0 by default (this happens with Lin ()). These predictions are now stored in pred1.pm and can be plotted in a specific way. Such as:
> plot(pred1, "slices", main=" correlation with PM10 increased by 10 ") > plot(pred1,ylab=" cumulative RR", main=" correlation with PM10 increased by 10 ")Copy the code
This function contains pred1.pm objects with stored results, with parameter “slices” defining the diagrams we want to draw for specific values of Predictor and LAG in the related dimensions. When var=10, I show a lag response relationship for a specific value of PM10, i.e. 10µgr/m3. The association is defined using a reference value of 0µgr/m3 to provide a predictive specific association for the increase of 10 units. I also chose a different color for the first image. A specific value of PM10, i.e. 10 µgr/m3. This association is defined using a reference value of 0 µgr/m3, thus providing a predictive variable-specific association for an additional 10 units. I also chose a different color for the first image. The cumul parameter indicates whether incremental cumulative associations previously stored in pred1.pm must be drawn. Figure 1A-1B shows the results. The confidence interval is set to the default value “area” for the ci parameter. In the left panel, the other parameters are passed to the plotting function Polygon () via ci.arg, and shaded lines are drawn as confidence intervals. These graphs are interpreted in two ways: the lag curve represents the increase in risk for each future day after PM10 increased by 10µgr/m3 on a given date (forward interpretation), or the contribution of the same PM10 for each past day to the increase in risk on a given date (backward interpretation). The graphs in Figures 1A-1B show that the initial increase in PM10 risk is reversed with a long lag time. The overall cumulative effect of PM10 increasing by 10 units in 15-day lag time (i.e. adding all contributions to the maximum lag time) and its 95% confidence interval can be extracted from the objects allRRfit, allRRhigh and allRRlow contained in pred1.pm, type:
> pred1
10
0.9997563
> cbind(pred1.p
[1] 0.9916871 1.0078911
Copy the code
Example 2: Seasonal analysis
The purpose of the second example is to show that the data were limited to a specific season. This analysis is unique in that it is assumed that the data is composed of multiple isometric ordered series of multiple seasons of different years, rather than a single continuous series. In this case, I evaluated the effects of ozone concentration and temperature on 5-day and 10-day lag mortality, respectively, using the same steps already seen in Section 3. First, I create a seasonal time series data set, just take the summer interval, and save it in the data box ChicagonMaps:
Sseas <- subset(NMMAPS, month %in% 6:9)
Copy the code
Again, I first create the cross basis matrix:
Cb (o3, lag=5, argvar=list(fun=" THR ", THR =40.3), arglag=list(fun="integer"), group=year)Copy the code
Parameter groups indicate variables that define multiple sequences. Each sequence must be continuous, complete, and ordered. Here, I assume that the effect of O3 is zero until it reaches 40.3 µgr/m3, then linear, and apply high threshold parameterization (Fun = “THR”). For temperature, I use a double threshold and assume that the effect is linear below 15°C and above 25°C, and zero in between. The threshold is selected using the argument thr.value (abbreviated THR), while the unspecified argument side sets the default value of the first cross-reference to “H” and the default value of the second cross-reference to “D” (given) provides two thresholds). Regarding the lag dimension, I have specified an unconstrained function for O3 that applies an integer for each lag (fun = “INTEGER”) of up to 5 days (by default, the minimum lag is equal to 0). For temperature, I have defined three ranges of lag 0-1, 2-5, and 6-10. Regression models include natural splines for days and days of the year in order to describe annual seasonal effects and long-term trends, respectively. In particular, the latter analysis has much less freedom than previous analyses, since it only needs to capture smooth annual trends. Otherwise, estimate and forecast in the same way. The code is:
glm(death ~ cb2.o3 + cb2.temp + ns(doy, 4) + ns(time,3) + dow,
family=quasipoisson())
Copy the code
The values to be predicted must be specified here: here I define integers from 0 to 65 µgr/m3 (roughly the ozone distribution range), plus a threshold and a value of 50.3µgr/m3, corresponding to an increase of 10 units above the threshold. Vectors will sort automatically. The reference-response curve modeled by THR () is automatically selected, and the center parameter may not be defined. I plotted the predictor-specific lag response with a 10-unit increase in O3, but with a confidence interval of 80%, and plotted the overall cumulative exposure response.
> plot(pred2.O3," slices", main=" latency response exceeds threshold 10 units "(80 confidence interval)) > plot(pred2.O3,"overall",xlab=" ozone ", ci="l", Main ="5 lagged aggregate cumulative associations ")Copy the code
In the first statement, the parameter CI = “bars” indicates that, unlike the default “areas” in Figure 1A-1B, confidence intervals are represented by a bar graph. In addition, the parameter ci.level = 0.80 indicates that an 80% confidence interval is plotted. Finally, I chose the point with a particular symbol based on the parameter type and PCH. In the second statement, the parameter type = “overall” indicates that the overall cumulative correlation must be drawn, the confidence interval is a line, ylim defines the range of the Y-axis, and LWD indicates the thickness of the line. Similar to the previous example, the display of confidence intervals is refined by a list of arguments specified by ci.arg, in which case it is passed to the low-level function lines (). Similar to the previous example, we can extract from PreD2.O3 the estimated overall cumulative effect of ozone concentration 10 units above the threshold (50.3−40.3µ GR /m3), with a 95% confidence interval:
> pred2. O3 $allRRfit 50.3 1.047313 (" 50.3 ")Copy the code
> cbind(allRRlow, allRRhigh)["50.3",]
[1] 1.004775 1.091652
Copy the code
The same graphs and calculations can be applied to the thermal and cold effects of temperature. For example, we can describe an increased risk of exceeding a low threshold or a high threshold of 1°C. You can repeat the preceding steps to perform this analysis.
Example 3: TWO-DIMENSIONAL DLNM
In the previous example, the effects of air pollution (PM10 and O3 respectively) are assumed to be completely linear or above the threshold. This assumption helps to explain and represent the relationship: dimensions of the predictors are never considered, and it is easy to plot specific or aggregate cumulative associations for a 10-unit increase. In contrast, when considering the nonlinear correlation of temperature, we need to adopt a two-dimensional perspective to represent the correlation of nonlinear variation along the space of predictive variables and hysteresis. In this example, I have specified a more complex DLNM that uses smooth nonlinear functions of two dimensions to estimate correlations. Although the complexity of the relationship is higher, we will see that the steps required to specify and fit the model and predict the results are exactly the same as the simple model we saw earlier, simply by selecting a different plot. Users can follow the same steps to investigate the temperature effects in the previous example and extend the graphs for PM10 and O3. In this case, I ran DLNM to study the effect of temperature and PM10 on mortality to 30 and 1 lag respectively. First, I defined the cross basis matrix. In particular, the cross basis for temperature is specified by natural and unnatural spline curves, using the functions NS () and BS () from the package spline curves. The code is as follows:
> varknots <- equalknots(temp,fun="bs",df=5,degree=2)
> lagknots <- logknots(30, 3)
Copy the code
The selection basis function of the prediction space is a linear function of the PM10 effect and a quadratic B-spline of temperature 5 degrees of freedom (FUN = “BS”) selected by the function Equalknots (). By default, nodes are placed at equidistance values in the predictor space. Regarding the lag space, I assume that the simple lag 0-1 of PM10 is parameterized (i.e., up to a single layer of lag 1, the minimum lag defaults to 0, keeping the default df=1), and I define another cubic spline, this time with a natural constraint on the temperature lag dimension (fun= “ns” default). Using the function logknots (), the nodes of the lag spline curve are placed at equal spacing values in the lag logarithmic scale. Estimates, projections and mapping of the relationship between temperature and mortality were carried out by:
> plot(pred3.temp, xlab=" temperature "lphi=30, main=" temperature effect ") Plot.title =title(" contour plot ",xlab=" temperature ",ylab=" lag ")Copy the code
Note that the predicted value here is centered on 21°C, and this point represents a reference to explain the estimated effect. This step is required here because the relationship is modeled using nonlinear functions with no obvious reference values. Use the argument by = 1 only in Crosspred () to select values that define all integer values in the range of predictive variables. The first drawing expression generates a 3D drawing, as shown in Figure 3A, where non-default viewing options are obtained with the argument Theta-phi-lphi. The second plotting expression specifies the contour plot in Figure 3b, where the title and axis labels are selected by the parameters plot.title and key.title. The graphs in Figs. 3a-3b provide a comprehensive summary of the two-dimensional expose-hysteresis – response association, but their ability to provide correlation information at predicted or specific values of hysteresis is limited. In addition, since the uncertainties of estimated associations are not reported in 3d and contour plots, they are also limited to inference purposes. More detailed analysis is provided by plotting effect surface “slices” of specific predicted and lagged values. The code is:
> plot(pred3.temp, "slices", var=-20, main=" latency reaction curves at different temperatures, Reference 21C") > for(I in 1:3) lines(pred3.temp, "slices", var=c(0,27,33)[I] > legend("topright",paste(" temperature =")Copy the code
Figure 4A-4b shows the results. FIG. 4A illustrates the hysteretic response curves for mild and extreme hot and cold temperatures (refer to 21℃) specific to -20℃, 0℃, 27℃, and 33℃. Figure 4 b
The expose-response relationship with hysteresis 0 and 5 (left column) and the hysteresis – response relationship with temperature -20℃ and 33℃ (right column) are described. The parameters var and LAG define the temperature and lag values for the “slice” to be cut on the effect surface in Figure 3A-3b. The argument ci = “n” in the first expression indicates that confidence intervals cannot be plotted. In multi-panel Figure 4b, the list parameter ci.arg is used to plot the confidence interval, which is used as a shadow line to increase gray contrast, which is more obvious here. Preliminary interpretation suggests that low temperatures have a longer risk of death than high temperatures, but not immediately, showing a “protective” effect at a lag of 0. This kind of analysis is difficult to implement with simpler models and can lose important details of associations.
Example 4: Dimension reduction DLNM
In the final example, I showed how to use the function CrossReduce () to reduce the fit of the two-dimensional DLNM to a summary represented by a wiki of parameters. First, I specify a new cross-basis matrix, run the model and make predictions in the usual way
The specified temperature cross base consists of a dual threshold function and a natural cubic spline, with cutoff points of 10°C and 25°C as the dimension of the predictor, respectively, and equally spaced node values on the logarithmic scale as the hysteresis, as shown in the previous example. Three specific abstracts can be reduced, i.e. total accumulation, lag specific, and predictive variable specific associations. The first two represent an expose-response relationship, while the third represents a lagged response relationship. The code is as follows:
credu(cb4, model4)
Copy the code
The reduction of correlation was calculated in two Spaces respectively when the lag was 5℃ and 33℃. The three objects of the “CrossReduce” class contain the modified reduction parameters of the wiki in the relevant space, which can be compared with the original model:
> length(coef(pred4))
[1] 10
> length(coef(redall)) ; length(coef(redlag))
[1] 2
[1] 2
> length(coef(redvar))
Copy the code
As expected, the number of parameters has been reduced to 2 for the space of predictive variables (consistent with two-threshold parameterization) and 5 for the lag space (consistent with the dimensions of the natural cubic spline curve base). However, the prediction of the original fit and the simplified fit are the same, as shown in FIG. 5A:
> plot(pred4, "overall", xlab=" temperature ", ylab="RR", ylim=c(0.8,1.6), main=" overall cumulative correlation ") > lines(redall, Ci = "lines", col = 4, lty = 2) > legend (" top ", c (" original ", "dimension"), col = c (2, 4), lty = 1:2, ins = 0.1)Copy the code
This process can also be illustrated by reconstructing the relationship between the original wiki and the predicted given correction parameter. As an example, I used Onebasis () to reproduce the natural cubic spline curve for lagging space and predicted the results:
The spline base is calculated based on an integer value corresponding to the hysteresis 0:30. The nodes have the same value as the original crossover base and are not centered. The intercept is the default value of the hysteresis base. The predicted value of 33℃ was calculated with the modified parameter. The same fit of the original, simplified and reconstructed predicted values is shown in FIG. 5b, which is derived from the following formula:
> the plot (pred4, "slices", var = 33, the main = "33 ° C" associated with a particular on predicting variables) > legend (" top ", C (" original ", "dimension reduction", "refactoring"),Copy the code
Most welcome insight
1. Use LSTM and PyTorch for time series prediction in Python
2. Long and short-term memory model LSTM is used in Python for time series prediction analysis
3. Time series (ARIMA, exponential smoothing) analysis using R language
4. R language multivariate Copula – Garch – model time series prediction
5. R language Copulas and financial time series cases
6. Use R language random wave model SV to process random fluctuations in time series
7. Tar threshold autoregressive model for R language time series
8. R language K-Shape time series clustering method for stock price time series clustering
Python3 uses ARIMA model for time series prediction