The importance of data is undeniable, but how do you make it valuable?

For a full-stack old code farmer, often in the development or R & D management encountered a variety of prediction, decision, inference, classification, detection, sorting and many other problems. Faced with the question “Does your code still bug?” A rational answer to such a challenge is that we have executed several test cases and the probability of a bug in the code is a fraction of a percent. That said, we are 99.9 percent confident that there are no bugs in the current program. This is essentially bayesian thinking all the time, or using bayesian methods. Whether we see it or not, it’s there, glowing.

How about predicting whether the current software has bugs? Again, we start with Bayes’ theorem.

Shallow solution of Bayes’ theorem

For old code farmers, the probabilistic expression of Bayes’ theorem is relatively clear and easy to understand. Recall from probability theory that joint probabilities satisfy the commutative law, namely:

P(A and B) = P (B and A)Copy the code

The joint probability is expanded with conditional probability:

P(A and B ) = P(A) P(B|A)
P(B and A ) = P(B) P(A|B)Copy the code

Thus:

P(A) P(B|A) = P(B) P(A|B)Copy the code

With a simple transformation, we get:

And you’re done. That’s the magic of Bayes’ theorem. Among them:

  • P (B) is the prior probability, that is, the probability of a certain hypothesis before new data is obtained;
  • P (B | A) for the A posteriori probability, the new data in observed after calculating the probability of the hypothesis;
  • P (A | B) as the likelihood of degrees, namely in the assumption that the probability of the data;
  • P (A) is the normalized constant, which is the probability of getting this data under any hypothesis.

You can also add A little spice, and when you calculate P (A), you can use the addition theorem:

P (A) = P (A and B) + P (A and B_) = P (A | B) P (B) + P (A | B_) P (B_)Copy the code

Which are:

Where B_ is the opposite of B. This approach is used in the article “The Idea of Bayesian inference,” which presents the results of Bayesian inference in terms of estimating between tests and bugs.

Bayes method

The Bayesian approach is a very general framework for reasoning. Updating our original belief about something with objective new information leads to a new and improved belief. By introducing prior uncertainty, the error of the initial inference is allowed. After obtaining updated evidence, the initial inference is not abandoned, but adjusted to be more consistent with the current evidence.

But, P (A | B) and P (B | A) often confusing, @ expresses Chen teacher gives the understanding of A key point to distinguish between regular and phenomenon, is to as A “law”, B as A “phenomenon”, so the bayesian formula as:

 

Chen teacher in “this understanding of bayes formula” and “another bayes application in life” gave a few easy to understand examples, here no longer repeat.

Back in the life of the coder, one of the tools we often use to improve system functionality is AB testing. The AB test is a statistical design pattern used to measure the degree of differentiation between two different processing methods, such as which of two websites leads to a higher conversion rate, which can be a user’s purchase, registration, or other behavior. The key point of AB testing is that only one difference can be allowed between groups. Post-experimental analysis is usually done using hypothesis tests, such as mean difference tests or proportional difference tests, which often involve Z-scores or confusing P-values, whereas bayesian methods are more natural.

Model the transformation probability of A and B websites. Beta distribution can be used when conversion rate is between 0 and 1. If the prior is Beta (A1, B1) and X transformations are observed in N visits, then the posterior distribution is Beta (A1 +X, B1 + n-x). Assuming prior is Beta (1,1), equivalent to uniform distribution on [0,1], the sample code is as follows:

From spicy. Stats import beta a1_prior =1 b1_prior =1 visitors_A = 12345 visitors_B = 1616 Conversions_from_A = 1200 // Number of converts on site A conversions_from_B = 150 // Number of converts on site B posterior_A = beta(a1_prior+ conversions_from_A,b1_prior + visitors_A -conversions_from_A) posterior_B = Beta(a1_prior+converiosns_from_B,b1_prior + Visitors_b-conversions_from_b) // Generate samples with the RVS method = 20000 samples_posterior_A = posterior_A. RVS (samples) samples_posterior_B = Posterior_B.rvs (samples) Print (samples_posterior_A > samples_posterior_B).mean()Copy the code

Using a Bayesian approach, you start by thinking about how data is produced. 1) What random variable can overdescribe these statistics 2) the parameters needed to confirm the probability distribution 3) parameters corresponding to early behavior, or late behavior, define various points of change 4) define the probability distribution of parameters 5) variable selection of the parameter probability distribution, up to a uniform distribution that can be assumed

The selection of prior and posterior probability depends on the application scenario. In terms of prior distributions, in addition to the usual distribution, there are: * Gamma distribution, a generalization of exponential random variables * Wisshot distribution, which is the distribution of all positive matrices, is an appropriate prior of a covariance matrix. Beta distribution, where the random variable is defined between 0 and 1, makes it a popular choice for probability and ratio. * Power-law distribution satisfies the relationship between company size and number of companies

Beta distribution is used in AB test, and the principle that a Beta prior distribution and binomial observation data form a Beta posterior distribution is applied.

When faced with the causal relationship between multiple objects, bayesian methods evolve into Bayesian networks.

Bayesian network

Bayesian network is proposed to solve the problem of uncertainty and incompleteness and has been widely used in many fields. Bayesian network is a graphical network based on probabilistic inference, and Bayesian formula is the basis of this probabilistic network. Each point in the Bayesian network represents a random variable, which has practical meaning and needs to be designed artificially. The edge between points and points represents uncertain causality. For example, node E directly affects node H, that is, E→H, then the directed arc (E,H) from node E to node H is established with the arrow pointing from E to H. Weight (that is, the connection strength) using conditional probability P (H | E).

In fact, if the relationship between things can be linked by a chain, it forms a special case of Bayesian network — Markov chain. From another perspective, Bayesian network is the nonlinear extension of Markov chain. In A Bayesian network, when a point of evidence appears, the probability of events in the whole network will change.

Simply put, since there are possible dependencies between multiple variables, Bayesian networks illustrate joint conditional probability distributions, allowing conditional independence to be defined between subsets of variables. The process of using Bayesian networks is similar to the process of using Bayesian methods:

  1. It is a directed acyclic graph that builds a network through multiple discrete variables
  2. Parameter setting or learning, that is, traversing the DAG and calculating the probability table of each node
  3. Network reasoning, get confidence probabilities for causality
  4. The reasoning results

For example, the detection of fake accounts in social networks. Firstly, determine the random variables in the network: * the authenticity of accounts A * the authenticity of profile pictures H * The density of posts, i.e. logs L * the density of friends F

Example H, L, F with observed values, assign random values to A, and get

P (A | H, L, F) = P (H | A) P (L | A) P (F | A, H)

You can then try to use the result in a social network. This example is explained in relative detail in the paper “Algorithmic Grocery Store: Bayesian Networks for Classification Algorithms”.

It can be said that Bayesian method has swept the whole probability theory and extended its application to various problem areas. The shadow of Bayesian method can be seen in all places where probability prediction is needed. In particular, what help can Bayesian method have for machine learning?

Bayes and machine learning

Machine learning is very popular in the industry, but we will also encounter problems such as prediction, decision making, classification and detection in machine learning, and Bayes method is also very useful.

There are a lot of models in machine learning, such as linear models and nonlinear models, which can be predicted by Bayesian method. In other words, there are an infinite number of possible models for a scene, which can be described by probability distribution. For the hypothetical prior, you make a prediction for the new sample like calculate its likelihood, and then you integrate it with the posterior distribution that you derived earlier, and the likelihood of the sample for that given model is the distribution of all possible models.

Model selection and comparison is also a common problem in machine learning. For example, when classifying problems, do we use linear models or nonlinear models of deep learning? The Bayesian approach looks like this: A represents A class of models, which may be linear models, and B represents another class of models, which may be nonlinear models. In the same data set X, calculate the likelihood Ma and Mb of the observed training set under A and B, and then compare Ma and Mb, which is A basic rule of bayesian method for model selection.

In fact, Bayes’ theorem is a criterion for information processing. The input is a prior distribution and a likelihood function, and the output is a posterior distribution. The model itself in machine learning can also be improved by Bayesian methods, such as Bayesian SVM, Bayesian Gaussian process and so on.

In addition, Bayesian methods are useful for deep learning, at least in this part of parameter tuning. In neural networks, each layer of parameters, such as the size and number of convolution kernels, will not be automatically optimized by the model in deep learning and need to be manually specified, which may be Called Bayesian optimization.

Sigh, code farmers do not know Bayes, although know data in vain ah!


Other references:
  • Bayesian Methods: Probabilistic Programming and Bayesian Inference
  • Bayesian Thinking: A Python Approach to Statistical Modeling
  • The Beauty of Mathematics: The Mundane and Magical Bayesian Method
  • Bayesian Method for Machine Learning www.cs.toronto.edu/~radford/ftp/bayes-tut.pdf

Reward me for writing more good articles, thank you!

To admire the authors’



comments

About the author:abel_cao

Personal home page
My article
25