Original link:tecdat.cn/?p=23038

Original source:Tuo End number according to the tribe public number

Introduction to the

Suppose we need to design a sample survey with a complete framework that includes information (identifying and supporting information) about the target population. If our sample design is stratified, we need to choose how to form stratification in the population in order to get the maximum advantage from the available auxiliary information.

In other words, we must decide how to combine the values of the auxiliary variables (from now on the “X “variable) to determine a new variable, called” layering “.

To do this, we must take into account the “Y “variable that is the target variable of the sampling survey: if, for the purpose of stratification, we choose the X variable that is most relevant to the Y variable, the sample sampling efficiency of the resulting stratification framework is greatly increased.

The numerical combination of each activity variable determines the specific stratification of the target population, the possible solution to the “best” stratification problem. Here, by optimal stratification, we mean stratification that ensures the minimum sample cost sufficient to satisfy the constraint on the estimated accuracy of the survey target variable Y’s (the constraint is expressed as the maximum allowable coefficient of variation for different areas of interest).

When the cost of data collection is uniform across layers, the total cost is proportional to the total sample size. In general, the number of possible alternative layering for a given population can be very large, depending on the number of variables and the number of their values, and in these cases it is not possible to enumerate them in order to evaluate the optimal layering. A very convenient solution is to adopt an evolutionary approach, including the application of genetic algorithms, which may converge to a near-optimal solution after a limited number of iterations.

steps

The optimization of sampling design begins with providing a sampling framework, determining the objective estimates of the survey, and determining the accuracy limits to them. Then, based on the analysis of the correlation between the two sets of variables (layering and target), which layering variables must be selected in the framework. When the selected stratified variables are both categorical and continuous variables, the continuous variables should be classified (for example, using the clustering k-means algorithm) in order to make them homogeneous. Conversely, if the hierarchical variables are of continuous type, the “continuous” method can be used to perform the optimization step directly. You can also perform both optimizations, comparing the results and choosing the more convenient method.

Before using genetic algorithm for optimization, it is better to run a different fast optimization task based on k-means algorithm, which has two purposes.

  • Provide hints for the appropriate amount of final layering.
  • Obtain an initial “good” solution as a “suggestion” for the genetic algorithm to accelerate its convergence to the final solution.

In the optimization step, you can indicate the set of sampling units (” take all “layers) that must be selected. After optimization, the quality of the solution can be evaluated through simulations that select a large number of samples from the framework and calculate sampling differences and biases for all target variables. It is also possible to “adjust” the sample size of the optimization plan according to the available budget: if a larger sample size is allowed, the sampling rate for each layer is proportionally increased until the new total sample size is reached; If we have to reduce the sample size, do the opposite.

Finally, we started to select samples.

In the following sections, we will illustrate each step starting with a real sampling framework.

Preparation of input required for optimization steps

The framework

For simplicity, let’s consider a subset of the data set.


head(mun)
Copy the code

To limit processing time, we selected only the first three locales and the variables of interest in our example. Each row of this dataset contains information for a city, identified by municipal number and municipal name, and belonging to one of the three selected districts.

Suppose we want to plan a sample survey, and the target estimate Ys is the total area of trees and buildings in each of the three districts (areas of interest). It is assumed that the total area and total population of each town are always updated. Look at the correlation matrix.

cor(mun[,c(4:8)])
Copy the code

We see that the correlation between the area of trees and the area of buildings, as well as the correlation between “area with buildings” and “total population” is high (0.77 and 0.87 respectively), so we decide to choose “area with buildings” and “total population” as the stratified variable X in our framework.

First, we decided to treat the hierarchical variables as categorical variables, so we had to cluster them. A suitable method is to apply k-means clustering method.

We can now define frame data frames in the required format. Organize the data in an appropriate model for further processing.

Frame(df = mun,value = "REG")

head(frame1)
Copy the code

Strata Data frame

This data frame is not required because it is automatically generated from the data frame. However, we need to use it to analyze the initial layering of the framework and the relevant sample sizes that might occur without optimization.

Strata(frameF)
Copy the code

Each row in the data frame outputs information about the given layering (obtained by cross-classifying each cell with the value of the X variable), including:

  • A layered identifier (named “strato”).
  • The values of the m auxiliary variables (named X1 through Xm) corresponding to the variables in the frame.
  • The total number of units in the population (named “N”).
  • Flag (named ‘CENs’) indicating whether the layer is being surveyed (=1) or sampled (=0).
  • Cost variable, representing the interview cost per unit in the stratification.
  • The mean and standard deviation of each target variable Y are named “Mi “and “Si” respectively.
  • The value of the domain of interest to which the layer belongs (‘DOM1’).

Precision constraint

The error data box contains the accuracy constraints set on the target estimate. This means defining a maximum coefficient of variation for each target variable and for each domain value. Each line of the framework is related to the precision constraints of the particular subdomain of interest, determined by the domainValue value. In our case, we chose to define the following constraints:

  • Hierarchical identifiers.
  • The values of the m auxiliary variables (named X1 through Xm) corresponding to the variables in the frame.
  • The total number of units in the population (named “N”).
  • Flag (named ‘CENs’) indicating whether the layer is being surveyed (=1) or sampled (=0).
  • Cost variable, representing the interview cost per unit in the stratification.
  • The mean and standard deviation of each target variable Y are named “Mi “and “Si” respectively.
  • The value of the domain of interest to which the layer belongs (‘DOM1’).
ndom <- length(unique(REG))

cv
Copy the code

This example reports the accuracy constraints (maximum CV allowed is equal to 10%) for the variables Y1 and Y2, and these constraints are the same for all 3 different subdomains of DOM1 at the domain level. Of course, we can distinguish precision constraints by region. It is important to note that the value of ‘domainValue’ is the same as the value in the data box and corresponds to the value of the variable ‘DOM1’ in the hierarchical data box.

Now we want to use the function Bethel(Bethel(1989)) to determine the total sample size and the associated distribution under a given stratification.

hel(strata1,cv)
Copy the code

This is the total sample size (570) required to satisfy the accuracy constraint under the current stratification before optimization.

To optimize the

A function that performs the optimization step. In effect, three different optimization functions are called:

  1. Optimize layering (method = atom, layering variables are categorized);
  2. Optimization layering 2 (method = continuous, layering variables are continuous);
  3. Optimize spatial layering (method = space, layering variables are continuous, or classified).

Here we report the most important parameters related to the three methods:

parameter instructions
frame The name of the data frame that contains information about the sampling frame
cens The name of the data box that contains the unit to be selected in any case
n The number of final optimization layers to obtain
err The data frame that contains the accuracy level
min The minimum number of units that must be allocated per layer

‘atomic’ method

The first time we run it, we perform the optimization step (since the hierarchical variables are of categorical type, atomic methods are required).

Strata(method = "atomic",
                    
                        pops = 10)
Copy the code

The implementation produced solutions to three different optimization problems. Convergence from the initial solution to the final solution is illustrated in the figure. Iterations performed, from 1 to maximum, are reported on the X-axis, while the sample size required to satisfy the accuracy constraints is reported on the Y-axis. The top (red) line represents the average sample size for each iteration, while the bottom (black) line represents the best solution discovered up to iteration I.

We can calculate (analyze) the expected CVs by executing functions:

The total sample size satisfying the accuracy constraint is much lower than that obtained by simply applying Bethel algorithm for initial layering, but it is not satisfactory.

To explore other solutions, we might want to treat each unit in the sampling framework as an atomic layer and have the optimization step summarize it based on the value of the Y variable. In any case, since we must specify at least one X variable, we can use a simple increasing number for this purpose.


frame2 <- FrameDF( mun)
head(frame2)
Copy the code

We can use this approach because the number of units in the framework is small.

Even so, processing the 1,823 layers can be slow.

In order to speed up convergence to the optimal solution, an initial solution can be given as a “suggestion”. This initial solution is generated by clustering the atomic layer by considering the mean values of all the target variables Y. The number of clusters with the minimum sample size required to satisfy the accuracy constraint is retained as the optimal number. In addition, the optimal number of clusters within each domain was determined. You can indicate the maximum number of clustering layers to obtain.

Strata(frame2, progress=F)
 Kmeans(strata = strata2,
                                    maxclusters = 10)  
Copy the code

The overall solution is achieved by concatenating the optimal clustering obtained from each domain. The result is a data framework with two columns: the first represents clustering and the second represents domains. Based on this, we can calculate the most convenient final number of layers for each domain.

apply(suggestions,
                 domainvalue,length(unique(x)))
Copy the code

We can provide the initial solution and the number of layers nstrata as input to the optimization step.

Strata(
                        nStrata = nstrata2)
Copy the code

Note that the solution obtained in this run is significantly better in terms of sample size than the previous one.

“Continuous” method

The last thing to do is test the continuous method.

First, we must redefine the framework Dataframe in this way.

FrameDF (+ id = "municipal Numbers," + X = c (" population ", "building area"), + Y = c (" have the structure of the region ", "forest area"), + value = "region") head (frame3)Copy the code

Also in this case, we want to generate an initial solution.

Kmeans2(frame=frame3,
                             maxclusters = 10)  
Copy the code

apply(sugge,
               value,
                 FUN=function(x) length(unique(x)))
Copy the code

Notice that this time we call a different function that requires not a hierarchical data framework but a direct frame data framework. In addition, we need an intermediate step to prepare recommendations for optimization.

We can now optimize the steps in a continuous way.

The total sample size required by this solution is the best solution we have produced so far, so we decided to choose this solution.

Analysis of the

Hierarchical structure

The results of the execution are contained in a “solution” list of three elements.

  1. Indices: A vector of indices indicating which collection layer each atomic layer belongs to (if atomic method is used) or which collection layer each unit in the framework belongs to (if continuous method is used).
  2. framenew: Updated initial framework, for each cell, indicating the optimal layer to which each cell belongs.
  3. aggr_strata: Data box containing optimization layer information.

When the hierarchical variables are of continuous type and the continuous (or spatial) method is used, it is possible to obtain detailed information about the optimized hierarchical structure.

summaryStrata(framenew,
                                
                                 progress=FALSE)
Copy the code

For each optimized layering, the total number of units and the sampling rate are reported. It also lists the range of hierarchical variables and describes the characteristics of each layer.

If the number of stratified variables is limited, as in our case, you can visualize the stratification by selecting a pair of variables and a field for each time period.

The plot (framenew, vars = c (" X1 ", "X2") and labels = c (" general ", "total"))Copy the code

 

Evaluate by simulation

To be confident in the quality of the solutions found, we run a simulation based on selecting the required number of samples from frameworks that have been identified as optimally layered. Users can indicate the number of samples to be taken.

Solution(
                     nsampl = 200) 
Copy the code

For each sample taken, an estimate of Y is computed. Their mean and standard deviation are also calculated to give CV and relative deviation for each variable in each domain.

coeff_var

rel_bias
Copy the code

 

You can also analyze the sampling distribution of estimates for each relevant variable in the selected domain.


hist(eval3 )
  
abline(v = mean(eval3$es )
abline(v = mean(frame3$Y )
 
Copy the code

Final sample size adjustment

After the optimization step, the final sample size is the result of unit allocation in the final stratification. This assignment is to satisfy the accuracy constraints. In fact, three things can happen:

  • The sample size generated is acceptable;
  • The sample size obtained was too large, that is, over budget;
  • The sample size produced was too small and the budget available allowed for an increase in the number of units.

In the first case, no change is required. In the second case, it is necessary to reduce the number of units, applying the same reduction rate on average in each layer. In the third case, we set out to increase the sample size, applying the same rate of increase to each stratification. This increase/decrease process is repeated, because by applying the same ratio, we can find that there are not enough units to increase or decrease in some layers. The ideal final sample size can be obtained.

Let’s assume that the final sample size obtained (106) is over budget. We can reduce this by executing the following code.

adjustSize(size=75,strata)
Copy the code

Conversely, if we want to increase the sample size, because the budget allows.

The difference between the desired sample size and the actual adjusted sample size depends on the number of layers in the optimization scheme. Consider the relative difference between the current sample size and the desired sample size when making adjustments in each layering: this will produce an allocation in real terms that must be rounded, taking into account the requirement for the minimum number of units in the layering (default: 2). The more layers, the greater the influence on the final adjusted sample size.

Once the sample size is increased (or decreased), we can check what the new expected CV is. By the second adjustment, namely increasing the total sample size, we get.

CV(adjusted)
Copy the code

That’s a big reduction in expected CV.

Sample selection

Once the optimal layering is obtained, samples can be selected from the optimized version of the framework with the optimal layering in mind.

Sample(new3, 
                     strata3,
Copy the code

A simple random sample is performed in each layer.

One variant is systematic sampling. The only difference is the method of selecting units in each layer, by performing the following steps:

  • The selection interval is determined by considering the reciprocal of sampling rate in stratification. The starting point is determined by selecting a value in the range.
  • Make the selection by selecting the unit corresponding to the above number as the first unit, and then selecting all units that have been added to the selection interval to be split.

This selection approach is useful if it is associated with a particular sorting of the selection framework, where the sorting variables can be treated as additional hierarchical variables. For example, we could consider that an industrial area (Airind) in a municipality might be important.

Sample(variable = c)Copy the code

“Take all” stratified sampling

Along with appropriate sampling layering, “all in all” layering may also be provided as input to the optimization step. These layers will not be optimized as appropriate layers, but they will help determine the best layering because they allow a smaller number of sample layer units to satisfy the accuracy constraints.

In order to perform optimizations and further steps correctly, it is necessary to preprocess the entire input. The first step to be taken was to dichotomize the units to be surveyed and the units to be sampled in order to establish two different frameworks. For example, we want to ensure that all cities with a total population of more than 10,000 will be included in the sample. So we divide the sampling frame in this way.

#---- Select the unit nrow(which(X1 > 10000) #---- select the unit nrow(rame3[-ind,]) from the frame to be sampled.Copy the code

Thus, we defined all units with a population of more than 10,000 as being subject to a census. At the end of this process, the sample will contain all of these units.

We now run the optimization step to include instructions from the units to be surveyed.

set.seed(1234)
Strata(method = "continuous",
                      )
Copy the code


Most welcome insight

1. Use R language for Metroplis-in-Gibbs sampling and MCMC operation

2. Bayesian model of MCMC sampling for Stan probabilistic programming in R language

3.R language implements Metropolis — Hastings algorithm and Gibbs sampling in MCMC

4. Bayesian analysis markov chain Monte Carlo method (MCMC) sampling

5. Block Gibbs Sampling Bayesian multiple linear regression in R language

6. Simple Bayesian linear regression simulation analysis of Gibbs sampling in R language

7.R language uses Rcpp to speed up metropolis-Hastings sampling to estimate the parameters of bayesian logistic regression models

8.R language uses metropolis-Hasting sampling algorithm for logistic regression

9. Har-rv model in R language based on mixed data sampling (MIDAS) regression predicts GDP growth