Sampling is the process of extracting corresponding sample points from a specific probability distribution. Sampling has very important applications in machine learning: it can simplify complex distribution into discrete sample points; Resampling can be used to adjust the sample set for better later model learning. It can be used for stochastic simulation to approximate solution or inference of complex models. In addition, sampling helps people to quickly and intuitively understand the structure and characteristics of data in data visualization.
1. The role of sampling
Sampling is the essence of the simulation of random phenomena, according to a given probability distribution, to simulate a corresponding random event. Sampling allows people to have a more intuitive understanding of random events and how they occur.
Sampling can also be regarded as a nonparametric model, which approximates the population distribution by using a few sampling points and characterizes the uncertainties in the population distribution. Sampling is also a kind of information dimensionality reduction, which can simplify the problem.
Resampling current data can make full use of existing data sets to mine more information, such as self-help methods and knife cutting methods. In addition, resampling technology can be used to consciously change the distribution of samples while maintaining specific information (target information is not lost), so as to adapt to subsequent model training and learning.
Due to the complex structure and implicit variables of many models, the corresponding solution formula is very complex, and there is no explicit analytical solution, so it is difficult to accurately solve or reason. Sampling methods can be used for random simulation to approximate solution or inference of complex models. It is generally translated into the integral or expectation of some functions under a specific distribution, or to find the posterior distribution of some random variables or parameters under a given data.
Therefore, we generally extract a subset from the population sample to approximate the population distribution, which is called the “training set”. Then the purpose of model training is to minimize the loss function on the training set. After the training is completed, another data set is required to evaluate the model, also known as the “test set”.
Some advanced uses of sampling, such as multiple resampling of samples to estimate statistical bias and methods, can also keep the target information unchanged, constantly changing the distribution of samples to adapt to model training and learning (classic applications such as solving the problem of sample imbalance).
2. Common sampling algorithms
- Inverse sampling
In some cases where a distribution is difficult to sample directly, a function transformation can be used. If there is a transformation relationship between the random variables x and u: u=ϕ(x), their probability density functions are as follows:
P (u) | ϕ ‘| = p (x) (x)
Therefore, if it is difficult to sample x from the target distribution p(x), you can construct a transformation u=ϕ(x) that makes it easier to sample u from the transformed earth distribution p(u), and thus obtain x indirectly by sampling U and then using the inverse function. In the case of a higher dimensional random variable, ϕ ‘(x) corresponds to the Jacobian determinant.
Moreover, if the Transform relation ϕ(·) is the cumulative distribution function of X, it is what we call Inverse Transform Sampling. We assume that the probability density function of the target distribution to be sampled is P (x), and its cumulative distribution function is:
The process of contravariant transformation sampling method:
-
Generate a random number Ui from the uniform distribution U(0,1)
-
Compute the inverse function
To get x indirectly
However, not all the inverse functions of the cumulative distribution function of the target distribution can be solved (or is easy to calculate). In this case, the inverse transformation Sampling method is not applicable, and Rejection Sampling and Importance Sampling can be considered.
- Refused to sampling
Sampling rejection, also known as Accept Sampling, for the target distribution P (x), select a reference distribution Q (x) that is easy to sample, so that for any x:
The sampling process is as follows:
1) A sample xi is randomly selected from the reference distribution Q (x)
2) generate a random UI from the uniform distribution U(0,1)
3) if, sample xi is accepted; otherwise, it is rejected, and steps 1-3 are repeated until the newly generated sample size meets the requirements.
In fact, the key to reject sampling is to select an appropriate new envelope function map for our target distribution P (x),, the function of normal distribution as shown in the figure below:
- Importance sampling
In addition, most of the time, the ultimate purpose of sampling is not to generate samples, but to carry out some follow-up tasks, such as predicting variable values, usually in the form of an expectation. Importance sampling is used to calculate the integral (function expectation) of the function f(x) over the target distribution P (x), i.e
3. Markov Monte Carlo sampling
In high-dimensional space, it is difficult to find suitable reference distribution for rejection sampling and importance sampling, and the sampling efficiency is very low. Therefore, Markov Chain Monte Carlo (MCMC) sampling method can be considered.
MCMC sampling method mainly includes two MCS, namely, Monte Carlo and Markov Chain. Monte Carlo refers to a numerical approximation solution method based on sampling, while Markov Chain is used for sampling. The basic idea of MCMC is as follows: In view of the target distribution for sampling, to construct a markov chain, the stationary distribution of the markov chain is target distribution, and then starting from any initial state, the state transition along the markov chain, the resulting state transition sequence will converge to the target distribution, giving the distribution of the target of a series of samples.
MCMC has different Markov chains, and different chains correspond to different sampling methods. The two common ones are metropolis-Hastings sampling method and Gibbs sampling method.
- Metropolis – Hastings sampling method
- Gibbs sampling method
4. Sampling of unbalanced samples
We encounter many unbalanced data sets in actual modeling, such as CTR models, marketing models, anti-fraud models, etc., with only a few percent of the bad samples (or good samples). Although some machine learning algorithms can solve the imbalance problem, such as XGBoost, we still need to sample data according to the actual business situation in most cases, mainly in two ways:
Over-sampling: Repeat random sampling from a small number of samples to ensure that the target category of the final sample is not too unbalanced;
Under-sampling: Certain samples are randomly selected from the majority of samples to ensure that the final sample is not unbalanced;
Refer to the video: www.bilibili.com/video/BV1fh…