Original link:tecdat.cn/?p=11386

Original source:Tuo End number according to the tribe public number

 

In this article, I will start with a basic linear model and then try to find a more suitable linear model from there.

Data preprocessing

Since the air quality data set contains some missing values, we will delete them before starting to fit the model, and select 70% of the samples for training and use the rest for testing:

data(airquality)
ozone <- subset(na.omit(airquality), 
        select = c("Ozone"."Solar.R"."Wind"."Temp"))
set.seed(123)
N.train <- ceiling(0.7 * nrow(ozone))
N.test <- nrow(ozone) - N.train
trainset <- sample(seq_len(nrow(ozone)), N.train)
testset <- setdiff(seq_len(nrow(ozone)), trainset)
Copy the code

Ordinary least squares model

As a benchmark model, we will use the ordinary least squares (OLS) model. Before defining the model, we define a function to draw the linear model:

rsquared <- function(test.preds, test.labels) {
    return(round(cor(test.preds, test.labels)^2.3))
}
plot.linear.model <- function(model, test.preds = NULL, test.labels = NULL, 
                            test.only = FALSE) {

 
    r.squared <- NULL
    if (!is.null(test.preds) && !is.null(test.labels)) {
        # store predicted points: 
        test.df <- data.frame("Prediction" = test.preds, 
                            "Outcome" = test.labels, "DataSet" = "test")
        # store residuals for predictions on the test data
        test.residuals <- test.labels - test.preds
        test.res.df <- data.frame("x" = test.labels, "y" = test.preds,
                        "x1" = test.labels, "y2" = test.preds + test.residuals,
                         "DataSet" = "test")
        # append to existing data
        plot.df <- rbind(plot.df, test.df)
        plot.res.df <- rbind(plot.res.df, test.res.df)
        # annotate model with R^2 value
        r.squared <- rsquared(test.preds, test.labels)
    }
    # # # # # # #
    library(ggplot2)
    p <- ggplot() 
    return(p)
}
Copy the code

Now, we use LM and study the confidence interval of feature estimation to establish OLS model:


confint(model)
Copy the code
## (Intercept) - + 1.10645e +02 - + 20.88636548 ## solar. R 7.153968e-03 0.09902534 ## Temp 1.054497e+00 2.07190804 ## Wind -3.992315e+00 -1.24576713Copy the code

We see that the model seems unsure about the intercept setting. Let’s see if the model still performs well:

Looking at the fit of the model, there are two main observations:

  • High ozone levels are underestimated
  • Negative ozone levels are expected

Let’s examine these two issues in more detail.

High ozone levels are underestimated

As can be seen from the figure, when ozone is in the range of [0,100], the linear model is very suitable for the results. However, when the actual observed ozone concentration is higher than 100, the model greatly underestimates the value.

We should ask whether these high ozone levels are not the result of measurement errors. Given typical ozone levels, the measurements seem reasonable. The highest ozone concentration is 168 PPB (parts per billion), with typical peak concentrations in U.S. cities ranging from 150 to 510 PPB. That means we should really care about outliers. Underestimating high ozone levels would be particularly harmful because they pose health risks. Let’s investigate the data to determine why the model has these outliers.

 

The histogram shows that there is indeed a problem with the value of the right tail of the residual distribution. Since residuals are not truly normal distributions, linear models are not optimal. In fact, residuals seem to follow some form of Poisson distribution. To figure out why the least-squares model fits so poorly, let’s look at the data again.

## 1st Qu.:115.8 1st Qu.:223.5 1st ## Ozone Solar.R Wind Temp ## Qu.:3.550 1st Qu.:81.75 ## Median :120.0 Median :231.5 Median :4.050 Median :86.50 ## Mean :128.0 Mean :236.2 Mean :4.583 Mean :86.17 ## 3rd Qu.:131.8 3rd Qu.:250.8 3rd Qu.:5.300 3rd Qu.:89.75 ## Max. :168.0 Max. :269.0 Max. :8.000 Max. : 94.00Copy the code
summary(ozone)
Copy the code
## 1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 0 Min. : 0 Min. : 0 Min. 7.40 1st Qu.:71.00 ## Median: 31.0 Median :207.0 Median: 9.70 Median :79.00 ## Mean: 42.1 Mean :184.8 Mean: 9.94 Mean: ## 3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50 ## Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00Copy the code

From the distribution of the two sets of observations, we do not see a significant difference between the high ozone observations and the other samples. However, we can use the model prediction graph above to find the culprit. In this figure, we see that most of the data points are centered around the [0,50] ozone range. To fit these observations well, the negative value of the intercept is -65.77, which is why the model underestimates ozone levels with larger ozone values and underestimates ozone in the training data.

The model predicts negative ozone levels

The model typically predicts negative ozone levels if the observed ozone concentration is close to zero. Of course, it can’t be because the concentration can’t go below zero. Again, we investigate the data to find out why the model still makes these predictions.

To do this, we will select all observed ozone levels in the 5th percentile and investigate their eigenvalues:


# compare observations with low ozone with whole data set
summary(ozone[idx,])
Copy the code
## 1st Qu.:4.5 1st Qu.:20.50 1st Qu.: 10.00 Min. : 10.00 Min. : 10.00 Min. : 10.00 Min. : 10.00 Min. 9.85 1st Qu.:59.5 ## Median :6.5 Median :36.50 Median :12.30 Median :61.0 ## Mean :5.5 Mean :37.83 Mean :13.75 Mean :64.5 ## 3rd Qu.:7.0 3rd Qu.:48.75 3rd Qu.:17.38 3rd Qu.:67.0 ## Max. :8.0 Max. :78.00 Max. :20.10 Max. :80.0Copy the code
summary(ozone)
Copy the code
## 1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 0 Min. : 0 Min. : 0 Min. 7.40 1st Qu.:71.00 ## Median: 31.0 Median :207.0 Median: 9.70 Median :79.00 ## Mean: 42.1 Mean :184.8 Mean: 9.94 Mean: ## 3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50 ## Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00Copy the code

We found that at low ozone levels, the average solar radiation was much lower and the average wind speed was much higher. To understand why we have negative predictions, let’s now look at the model coefficients:

coefficients(model)
Copy the code
Solar.R Temp Wind ## -65.76603538 0.05308965 1.56320267-2.61904128Copy the code

Therefore, for lower ozone levels, the positive coefficient Solar.R cannot compensate for the intercept, and Wind because for lower ozone levels, the value of Solar.R is lower, while the value of Wind is higher.

Deal with negative ozone level projections

Let us first deal with the problem of predicting negative ozone levels.

Truncated least squares model

An easy way to deal with a negative forecast is to replace it with the smallest possible value. That way, if we give the model to the customer, he won’t start to suspect there’s something wrong with the model. We can do this using the following features:

Now let’s verify how this improves our prediction of test data. Remember that [R2[R2’s original model was 0.6040.604.

nonnegative.preds <- predict.nonnegative(model, ozone[testset,])
# plot only the test predictions to verify the results
plot.linear.model(model, nonnegative.preds, ozone$Ozone[testset], test.only = TRUE)
Copy the code

As we can see, this hack can suppress the problem and increase [R2[R2 to 0.6460.646. However, correcting negative values in this way does not change the fact that our model is wrong, because the fitting process does not take into account that negative values should be impossible.

Poisson regression

To prevent negative estimates, we can use a generalized linear model (GLM) that assumes a Poisson distribution rather than a normal distribution:


plot.linear.model(pois.model, pois.preds, ozone$Ozone[testset])
Copy the code

[R2[R2 value 0.616 indicates that Poisson regression is slightly better than ordinary least squares (0.604). However, its performance is not better than that of a model with a negative value of 0 (0.646). This may be because the variance of ozone levels is much larger than assumed by Poisson’s model:

# mean and variance should be the same for Poisson distribution
mean(ozone$Ozone)
Copy the code
# # 42.0991 [1]Copy the code
var(ozone$Ozone)
Copy the code
# # 1107.29 [1]Copy the code

Logarithmic transformation

Another way to deal with negative predictions is to take the logarithm of the result:


print(rsquared(log.preds, test.labels))
Copy the code
# # 0.616 [1]Copy the code

Note that although the results are the same as those obtained by Poisson regression, the two methods are not usually the same.

Underestimation of overestimated ozone levels should be addressed

Ideally, we would do a better job of measuring at higher ozone levels. However, since we can’t collect any more data, we need to use what we have. One way to deal with underestimating high ozone levels is to adjust the loss function.

The weighted regression

Using weighted regression, we can influence the effect of outlier residuals. To do this, we will calculate the Z-score for ozone levels and then use its exponent as the weight of the model to increase the influence of outliers.


plot.linear.model(weight.model, weight.preds, ozone$Ozone[testset])
Copy the code

 

This model is definitely more suitable than ordinary least square model because it can handle outliers better.

The sampling

Let’s take samples from the training data to make sure we don’t have too much ozone again. This is similar to doing weighted regression. However, instead of giving a smaller weight to observations of low ozone levels, we gave them a weight of 0.


print(paste0("N (trainset before): ".length(trainset)))
Copy the code
## [1] "N (trainset before): 78"
Copy the code

print(paste0("N (trainset after): ".length(trainset.sampled)))
Copy the code
## [1] "N (trainset after): 48"
Copy the code

Now, let’s build a new model based on the sampled data:


rsquared(sampled.preds, test.labels)
Copy the code
# # 0.612 [1]Copy the code

As we have seen, the model based on sampled data does not perform better than the weighted model.

In combination with

Seeing that Poisson regression can be used to prevent negative estimation and that weighting is a successful strategy to improve outlier prediction, we should try to combine the two methods to obtain a weighted Poisson regression.

Weighted Poisson regression


p.w.pois
Copy the code

 

As we have seen, the model combines the advantages of using Poisson regression (non-negative prediction) with usage-weight (underestimating outliers). Indeed, [R2[R2 has the lowest price for this model (truncated linear model 0.652 vs 0.646). Let’s investigate model coefficients:

coefficients(w.pois.model)
Copy the code
Solar.R Temp Wind ## 2.069357230 0.002226422 0.029252172-0.104778731Copy the code

The model is still controlled by an intercept, but is now positive. So if all the other features are 0, the model’s prediction will still be positive.

But suppose the mean is equal to the variance of poisson’s regression?

print(paste0(c("Var: "."Mean: "), c(round(var(ozone$Ozone), 2),
        round(mean(ozone$Ozone), 2))))
Copy the code
## [1] "Var: 1107.29" "Mean: 42.1"
Copy the code

The model’s assumptions are never satisfied, and because the variance is larger than the model’s assumptions, we suffer from excessive dispersion.

Weighted negative binomial model

Therefore, we should try to choose a model that is more suitable for over-dispersion, such as the negative binomial model:


plot.linear.model(model.nb, preds.nb, test.labels)
Copy the code

 

Therefore, the weighted negative binomial model is not better than the weighted Poisson model in terms of the performance of the test set. However, when inferring, this value should be better because its assumptions are not broken. Looking at the two models, it is clear that their P values differ greatly:

coef(summary(w.pois.model))
Copy the code
# # Estimate Std. Error z value (Pr > | z |) # # (Intercept) 2.069357230 0.2536660583 8.157801 3.411790 e-16 # # Solar. R 0.002226422 0.0003373846 6.599061 4.137701E-11 ## Temp 0.029252172 0.0027619436 10.591155 3.275269E-26 ## Wind 0.104778731 0.0064637151 16.210295 4.265135 e-59Copy the code
coef(summary(model.nb))
Copy the code
# # Estimate Std. Error z value (Pr > | z |) # # (Intercept) 1.241627650 0.640878750 1.937383 5.269853 e-02 # # Solar. R 0.002202194 0.000778691 2.828071 4.682941E-03 ## Temp 0.037756464 0.007139521 5.288375 1.234078E-07 ## Wind -0.088389583 E-08 0.016333237 5.411639 6.245051Copy the code

Although the Poisson model claims that all coefficients are very significant, the negative binomial model shows that the intercept is not significant. The confidence band of a negative binomial can be found by:


CI.int <- 0.95
# calculate CIs:
df <- data.frame(ozone[testset,], "PredictedOzone" = ilink(preds.nb.ci$fit), "Lower" = ilink(preds.nb.ci$fit - ci.factor * preds.nb.ci$se.fit),
                "Upper" = ilink(preds.nb.ci$fit + ci.factor * preds.nb.ci$se.fit))
Copy the code

Using a constructed data box that contains the eigenvalues of the test set as well as the predictions with their confidence bands, we can plot how the estimates fluctuate with respect to independent variables:

# plot estimates vs individual features in the test set
plots <- list()

grid.arrange(plots[[1]], plots[[2]], plots[[3]])
Copy the code

These pictures say two things:

  • There is a clear linear relationshipWindandTemperature. Estimated ozone levelsWindWhile the estimated ozone level increasesTempIncreased.
  • The model is most confident about low ozone levels, but less confident about high ones

Data set expansion

With the model optimized, we now return to the initial data set. Remember our observation that we removed all missing values at the beginning of the analysis? Well, that’s not ideal, because we’re already throwing away valuable information that could be used to get better models.

Investigate missing value

Let’s first investigate the missing values:


# ratio of missing values
ratio.missing <- length(na.idx) / nrow(ozone)
print(paste0(round(ratio.missing * 100.3), "%"))
Copy the code
# # [1] "27.451%"Copy the code
# which features are missing most often?
nbr.missing <- apply(ozone, 2.function(x) length(which(is.na(x))))
print(nbr.missing)
Copy the code
##   Ozone Solar.R    Wind    Temp 
##      37       7       0       0
Copy the code
# multiple features missing in one observation?
nbr.missing <- apply(ozone, 1.function(x) length(which(is.na(x))))
table(nbr.missing)
Copy the code
## nbr.missing
##   0   1   2 
## 111  40   2
Copy the code

The investigation revealed that a considerable number of observations had previously been excluded due to a lack of values. More specifically, the only missing features are Ozone (37) and Solar.r (7). Typically, only one function was missing (40 times), and rarely two (2 times).

Adjust training and test metrics

To ensure that the tests were performed with the same observations as before, we had to map to the full air quality data set:


trainset <- c(trainset, na.idx)
testset <- setdiff(seq_len(nrow(ozone)), trainset)
Copy the code

Estimated missing value

To get an estimate of the missing value, we can use interpolation. The idea is to use known features to form predictive models in order to estimate missing features.


summary(as.numeric(imputed.data$Ozone))
Copy the code
## min.1st qu. Median on 3rd qu.max. ## 1.00 16.00 30.00 41.66 59.00 16.00Copy the code

Note that aregImpute uses different bootstrap samples for multiple interpolation, which can be specified using the N. impute parameter. Since we want to use calculations for all runs rather than a single run, we will define the model using the fit.mult.impute function:

# compute new weights

plot.linear.model(fmi, fmi.preds, ozone$Ozone[testset])
Copy the code

 

Let’s use just one interpolation to specify the weights again:


rsquared(w.pois.preds.imputed, imputed.data$Ozone[testset])
Copy the code
# # 0.431 [1]Copy the code

In this case, a weighted Poisson model based on estimated data does not perform better than a model that excludes only missing data. This suggests that there is much more estimation of missing values than introducing noise into the data, rather than signals we can use. The possible explanation is that samples with missing values have a different distribution from all the measurements available.

Abstract

We started with OLS regression model ([R2=0.604 [R2=0.604) and tried to find a more suitable linear model. The first idea was to truncate the prediction of the model to 0 ([R2=0.646 [R2=0.646). In order to predict outliers more accurately, We trained the weighted linear regression model ([R2=0.621 [R2=0.621). Next, to predict only positive values, we trained the weighted Poisson regression model ([R2=0.652 [R2=0.652). In order to solve the over-dispersion problem in the Poisson model, We develop a weighted negative binomial model. Although this model does not perform as well as the weighted Poisson model ([R2= 0.638), it may be better at reasoning.

Thereafter, we attempted to further refine the model by estimating the missing values using Hmisc packages. Although the generated models are better than the original OLS models, they do not achieve higher performance than before ([R2=0.627 [R2=0.627).

So, what is the best model? In terms of the correctness of the model assumptions, this is a weighted negative binomial model. In terms of determination coefficient, [R2[R2, this is the weighted Poisson regression model. Therefore, for the purpose of predicting ozone levels, I will choose the weighted Poisson regression model.

You may ask: Is all this work worth it? In fact, there is a significant difference between the prediction of the initial model and the weighted Poisson model at the 5% level:

## ## Wilcoxon signed rank test ## ## data: Test. Preds and w.peois. Preds ## V = 57, p-value = 1.605e-05 ## alternative hypothesis: true location shift is not equal to 0Copy the code

When we compare them, the differences between the models become apparent:

In summary, we move from models that predict negative values and underestimate high ozone levels (OLS model shown on the left) to models that have no such obvious defects (weighted Poisson model on the right).