0 x00 the
In the case of less mathematical formula, this paper tries to explain maximum likelihood estimation & maximum posteriori probability estimation by perceptual intuition, and finds several examples from famous works to show us how to apply these two estimates-its very interesting characteristics.
0x01 Background
1. Probability vs. statistics
Probability and statistics seem to be two similar concepts, but in fact, the research issues are opposite.
1.1 probability
Probability studies the probability of an event occurring after the model and parameters are known.
Probability is a deterministic thing, an ideal value. According to the law of large numbers, as the number of experiments approaches infinity, the frequency is equal to the probability.
The frequency school believes that the world is deterministic, and the parameter θ is a deterministic value during modeling, so their view is to directly model time itself.
1.2 statistical
Statistics is based on the given observation data, using these data for modeling and parameter prediction.
Statistical colloquial expression is to obtain the corresponding model and its description parameters based on the observed data (for example, it is assumed that the model is a Gaussian model, and the specific parameters of the model, such as σ,μ, etc.).
A word summary: probability is known model and parameters, push data. Statistics are known data, extrapolated models and parameters.
Frequency vs. Bayes
The frequency school and the Bayes school have fundamentally different perceptions of the world.
2.1 Frequency school and Bayes school discuss “uncertainty” from different starting points and foothold
The frequency school believes that the world is fixed and there is an ontology whose truth value is constant. Our goal is to find the truth value or the range of truth value. Bayes believes that the world is uncertain, that people have a prediction of the world, and then adjust the prediction through observational data. Our goal is to find the probability distribution that optimizes the description of the world.
2.2 Frequency school and Bayes school solve problems from different angles
From the perspective of “nature”, the frequency school attempts to directly model the “event” itself, that is, the frequency of an event occurring in an independent repeated test tends to a limit, and that limit is the probability of the event.
Bayes does not attempt to depict the “event” itself, but rather the “observer”. Bayesian school of thought does not attempt to say “the event itself is random”, or “ontology has some randomness of the world”, the theory did not speak about the “ontology of the world”, but only from the incomplete knowledge “observer” starting point, to construct a set of in the framework of bayesian probability theory can make inference method of uncertain knowledge.
3. Probability function vs likelihood function
Probability: parameter + observation --> result likelihood: observation + result --> parameterCopy the code
If have a function of P (x | theta), including theta is the need to estimate the parameters, x is the detailed data or samples.
3.1 Probability Function
If θ is known and determined, and x is the variable, the function is called probability function, which describes the probability of occurrence of different sample points x (represents the probability of occurrence of different x).
Probability functions belong to known models and parameters for time prediction analysis. Probability functions are used to predict the outcome of subsequent observations given a number of parameters.
3.2 Likelihood function
If x is known and θ is a variable, this function is called likelihood function, and it describes the probability of the sample point x occurring for different model parameters θ. At this time of the function is denoted by L (theta | x) or L (x; Theta) or f (x; Theta)
Likelihood function is a function about parameters in statistical model, which represents the likelihood (possibility) of parameters in the model. Given a set of observation data, the parameters related to the properties of things are estimated, that is, given specific sample data, the parameters of the model are analyzed and predicted.
Maximum likelihood is the maximum possibility of model parameters.
4. Parameter estimation
Parameter estimation is a type of statistical inference. The process of estimating unknown parameters in a population distribution from random samples taken from the population.
“Machine learning” is the process of grouping large amounts of data into a few parameters, and “training” is the process of estimating those parameters.
The ultimate problem of modern machine learning is to solve the optimization problem of objective function. MLE and MAP are the basic ideas to generate this function.
-
Maximum Likelihood Estimation (MLE) is a parameter Estimation method commonly used by frequency school.
-
Maximum A Posteriori, MAP is A parameter estimation method commonly used by Bayesian school.
In the modeling of things, θ is used to represent the parameters of the model, and the essence of solving the problem is to find θ. So:
4.1 Frequency School
The frequency school holds that there is a unique truth value θ.
4.2 Bayes School
According to the Bayes school, θ is a random variable and conforms to a certain probability distribution. That is, the parameter θ of the model is not considered to be a definite value, but that the parameter θ itself is subject to some potential distribution.
In The Bayes school, there are two inputs and one output, the input is prior and likelihood, and the output is posterior.
Prior, or θ, refers to the prejudgment of θ when no data is observed;
Likelihood that p (x | theta), is to assume that we observed after theta known what data should be;
A posteriori, namely, p (theta | x), is the final parameter distribution.
That is, event modeling assumes an estimate (prior probability), and then adjusts the previous estimate based on the observed data.
Maximum Likelihood Estimation (MLE)
1. The thought
Maximum likelihood estimation is a “model determined, parameters unknown” method, that is, using the results of known samples, based on the use of a model, deduce the model parameter values that are most likely to lead to such results.
The idea of maximum likelihood estimation: the parameter that maximizes the probability of occurrence of observed data (sample) is the best parameter.
In popular terms, it is the most like estimation method (most likely estimation method), that is, the event with the highest probability, the most likely to happen.
Maximum likelihood estimation is a typical viewpoint of frequency school. The basic idea of maximum likelihood estimation is that the parameter θ to be estimated exists objectively but is unknown. When θ -MLE satisfies “θ = θ -MLE, the observed samples (X1,X2… ,Xn) = (x1, x2,… ,xn) is more easily observed “, we say that [θ-mle] is the maximum likelihood estimate of [θ]. That is, the estimate [θ-mle] makes the event most likely to occur.
2. Likelihood function
Suppose the distribution is P= P (x; θ), x is the sample of occurrence, θ is the parameter of generation estimation, p(x; θ represents the probability of x occurring when the estimated parameter is θ. So when our sample values are: x1,x2… , xn,
L (theta) = L (x1, x2,… ,xn; Theta) = p (x1 | theta)… P (xn | theta) LianCheng
Where L(θ) becomes the likelihood function of the sample. If you have theta ^ that maximizes the value of L(θ), then theta ^ is called the maximum likelihood estimate for the parameter θ
The value that maximizes L(θ) is the maximum likelihood estimate of the parameter.
The problem of maximum likelihood estimation becomes the extremum of likelihood function.
3. Likelihood function transformation
The premise condition
The samples that can use the maximum likelihood estimation method must meet some premises, such as: the distribution of training samples can represent the real distribution of samples. The samples in each sample set are so-called independent and identically distributed random variables, and there are sufficient training samples.
Logarithmic likelihood function
For an independent identically distributed sample set, the likelihood of the population is the product of the likelihood of each sample. Since the likelihood of the population is the product of the likelihood of each sample, it is more troublesome to calculate the continuous product, and there will be the following problems:
-
Bottom overflow problem: it is the multiplication of too many very small numbers, the result may be very small, resulting in bottom overflow.
-
Floating-point rounding problem: the result may be 0 if the program rounds the corresponding decimal place.
In order to solve the problem, we usually take the logarithm of the likelihood function, so as to transform it into a logarithmic likelihood function.
Converting to a logarithmic likelihood function has the following benefits:
-
The logarithmic function does not affect the convexity of the function. Since the logarithm of ln is a monotonically increasing function, the maximum log value of the probability appears at the same point as the original probability function, so the extreme point is not changed.
-
Easy derivative: according to the previous likelihood function formula, is a bunch of numbers multiplied, this algorithm derivative will be very troublesome, and logarithm is a very convenient means. Since the logarithm is calculated by ‘lnab = blNA, lnab = lna + LNB’, the derivation is convenient. The multiplication of probabilities in the formula becomes the addition of logarithmic probabilities.
Since the likelihood function is differentiable, the stagnation point can be obtained by taking the derivative and the maximum can be calculated.
If the logarithmic likelihood function is simple, it can be directly derived, but in more cases, we need to solve it through optimization algorithms such as gradient descent. Most optimization toolkits minimize functions by default, so don’t forget to multiply your log Likelihood by -1 to make Negative log Likelihood before you stuff it into an optimization toolkit.
This is why some articles have the following formula:
Example 4.
Here’s a classic example from the Internet:
Suppose you have a jar with black and white balls, the number of which is unknown, and the ratio of which is unknown. We want to know the proportion of white balls to black balls in the pot, but we can't count all the balls in the pot. Now we can take one ball out of the jar at any time, record the color of the ball, and put the ball back into the jar. This process can be repeated, and we can use the color of the recorded balls to estimate the proportion of black and white balls in the tank. If 70 of the previous 100 repetitions were white balls, what is the most likely proportion of white balls in the tank? Many of you know the answer right away: 70%. What is the rationale behind it? Let's say that the ratio of white marbles in a jar is P, so the ratio of black marbles is 1 minus P. Because after recording the color of each ball, we put it back into the pot and shake it well, the color of each ball follows the same independent distribution. Here we call the color of the ball drawn once a sampling. Sampling in the title in one hundred times, seventy times is white balls probability is P (Data | M), where the Data is all of the Data, M is given by the model, said every time out of the ball is white probability for P. If the result of the first sample is called x1 and the result of the second sample is called x2... So Data = (x1,x2... The x100). So, P (Data | M) = P (x1, x2,... , x100 | M) = P (x1) | M P (x2) | M... P (x100 | M) = P (1 - P) ^ ^ 70 30. Then P takes in what values, P (Data | M) the value of the largest? Take the derivative of p to the 70 times 1 minus p to the 30 with respect to p, and that equals zero. 70 p ^ ^ (1 - p) 30-69 p ^ 70 * (1 - p) ^ 29 30 = 0. Solve the equation and you get P =0.7. In the boundary point p = 0, 1, p (Data | M) = 0. So when p = 0.7, p (Data) | M value maximum. And that's the same thing that we see in our common sense in terms of the proportion in the sample.Copy the code
5. Solving steps of maximum likelihood estimation:
-
Determine the likelihood function
-
Convert the likelihood function to a logarithmic likelihood function
-
Find the maximum value of the logarithmic likelihood function (take the derivative and solve the likelihood equation)
Does the maximum likelihood estimate always give an exact solution? In short, no. More likely, in real-world scenarios, derivatives of log-likelihood functions are still difficult to parse (that is, difficult or impossible to differentiate manually). Therefore, iterative methods such as expectation maximization (EM) algorithm are generally used to find numerical solutions for parameter estimation, but the general idea is the same.
6. Maximum likelihood estimation in water Margin
Words maximum likelihood estimation is a commonly used principle, I also found the relevant application in the “Water Margin”, the following look for a few.
It is very interesting that the application examples are all du Tou in the Northern Song Dynasty. One is yuncheng County head Lei Heng, the other is Qinghe County head Wu Song.
This can be seen that the northern Song grass-roots Interpol some work characteristics, they do not have a variety of modern scientific instruments and theoretical help, can only rely on “maximum likelihood estimation” this magic weapon in the first time to make the most likely the most effective judgment. It is obviously different from shi Xiu, a small merchant who implements “maximum posterior probability” in the paper.
6.1 Winged Tiger Thunder catches red-haired ghost Liu Tang.
Suppose the distribution is P= P (x; θ), x is the sample of occurrence, θ is the parameter of generation estimation, p(x; θ represents the probability of x occurring when the estimated parameter is θ.
θ = What is Liu Tang? The possible value is about ordinary people/thieves/officers……
X = There is no temple in the temple, the door is closed, and a big man sleeps alone at night.
Lei Heng this experienced old Interpol, immediately made the most likely judgment.
θ is “Liu Tang is a thief”.
The 12th green face beast Beijing douwu pioneer Dongguo struggle
They only said that that night Lei Heng led twenty soldiers out of the east gate to patrol around the village and walked around the land. When they came back to the hill of Dongxi Village, they picked the red leaves and went down to the village. Line less than 32 li, early to lingguan temple, see the temple door is not. Lei Heng said: “There is no temple in this temple, the door is not closed, there are evil people in it? Let’s go in and have a look.” And they all came with a fire. There was a big man asleep at the altar. It was hot, and the man made a piece of old clothes into a pillow under his head, and fell asleep on the altar. Lei Heng looked at it and said, “How strange! Oh! County xianggong God! There really is a thief in Dongxi Village!” There was a roar. The man, however, was about to be defeated. Twenty soldiers marched him out of the temple, tied him with a rope, and threw him into a village of security.
6.2 Wu Song meets Jiang Door God for the first time
Suppose the distribution is P= P (x; θ), x is the sample of occurrence, θ is the parameter of generation estimation, p(x; θ represents the probability of x occurring when the estimated parameter is θ.
θ = what a big guy is. The value may be Jiang Menshen, shopkeeper, shopkeeper next door……
X = A king Kong is also like a big man lying in front of jiang Menshen hotel to enjoy the cool.
Wu Erlang immediately made a great likelihood judgment, this Han in front of jiang Menshen hotel to enjoy the cool, long figure king Kong also like, so this is certainly Jiang Zhong.
Theta = “The big man is the Door god of Jiang”
The twenty-eighth back to the heavy bully Meng Road Wu Song drunk playing Chiang door god
Wu Song wine poured up and spread out the cloth; Although with 50 to 70 percent of the wine, but pretend to be very drunk, before the forest, swaying back and forth, the servant said with his finger: “just ahead of the T-junction is jiang Menshen hotel.” Wu Song said, “Since you have arrived, go and hide yourself far away. When I fall, you come.”
Wu Song grabbed the back of the forest and saw a big man from King Kong. Wearing a white cloth, he spread out a chair and was sitting under a green locust tree to enjoy the cool. Wu Song pretended to be drunk and took a sidelong look, thinking to himself, “This big man must be the door god of Jiang.” Go straight for it. The line is less than thirty or fifty steps, early see t-crossing a hotel, eaves before standing looking pole, above hanging a wine wangzi, write four characters, way: “River Yang fengyue”.
6.3. Wu Song killed Wang Daoren
Suppose the distribution is P= P (x; θ), x is the sample of occurrence, θ is the parameter of generation estimation, p(x; θ represents the probability of x occurring when the estimated parameter is θ.
θ = What kind of person is wang Daoren? The possible value is normal Taoist person, evil person……
X = A man with a woman in his arms was watching the moon and laughing at the window of a lonely nunnery in the wild mountains.
Wu Erlang immediately made a great likelihood judgment, this is certainly not a good man.
Theta = “The king is evil”
The 30th back Zhang Dujian Blood splash Yuan Yang House Martial arts Night walk Centipede Ridge
That night, the walker left the tree cross slope and fell off the road. It was October. The day was short, and soon it was late. Before we had gone about fifty leagues, we saw a high ridge. Wuxing while the moon Ming, step by step on the ridge, the way is just the beginning of the sky. As the martialists stood on the top of the ridge and looked, they saw the moon coming from the east and shining with brightness all the vegetation on the ridge.
As I watched, I heard laughter in the woods ahead. The warrior said, “Again! How can anyone laugh at such a quiet mountain?” Passing through the forest, I found a graveyard near a hill in a grove of pine trees. There were about a dozen straw huts with two small Windows open. A gentleman and a woman were watching the moon and laughing out of the window.
When the warrior saw this, he said, “Anger rises from the heart, and evil grows to the edge of courage.” “This is a forest on the mountain, but monks do such things!” Then he went to his waist and pulled out the two pieces of rotten silver, which looked like martial dao. He looked at them in the moonlight and said, “The dao is good, but it doesn’t sell well in my hand. Let’s try the dao on Mr. Bird.”
0x03 Maximum Posterior Probability Estimation (MAP)
Maximum A Posteriori Estimation (MAP) MAP is a parameter estimation method commonly used by Bayesian school.
First review the concept of likelihood function: the function P (x | theta), including theta is the need to estimate the parameters, x is the specific data or samples. If x is known and θ is a variable, this function is called likelihood function, and it describes the probability of the occurrence of the sample point x for different model parameters.
Maximum posterior probability estimates can be derived from maximum likelihood estimates.
1. Reasoning process
Parameter theta maximum likelihood estimation is, the likelihood function P (x | theta) is the largest.
Maximum a posteriori probability estimation is wants to let theta P (x | theta) P (theta) is the largest. The obtained θ not only increases the likelihood function, but also increases the prior probability of θ itself.
The MAP is maximum in P (x | theta) P (theta)
Because in the actual experiment, p(x) has already happened, so p(x) is a fixed value, it is observed. thus
Maximize the MAP in the P (x | theta) P (theta)/P (x)
At this point it can be seen that MAP is influenced by two parts, P(x∣θ) which is similar to the likelihood function and P(θ) which is the prior distribution of parameters.
P (x | theta) P (theta)/P (x) = = > is P (theta | x). so
The MAP is maximum in p (theta | x) = p (x | theta) p (theta)/p (x)
To maximize the meaning of P (theta | x) is very clear also, x has emerged, request takes theta what value to make P (theta | x) is the largest. By the way, P (theta | x) the a posteriori probability, which is the origin of the name, the maximum a posteriori probability estimation.
2. The above reasoning can also be translated into the following statement
Maximum likelihood estimates of the likelihood function P (x | theta) the biggest parameter theta is the best theta, the maximum likelihood estimation is the theta as fixed values, only the value of the unknown;
Think theta maximum a posteriori probability distribution is a random variable, namely the theta with a certain probability distribution, known as the prior distribution, in addition to consider when solving likelihood function P (x | theta), consider the prior distribution P (theta) of theta, therefore its think make P (x | theta) P (theta) is the best way to get maximum theta theta.
At this time to maximize the function into P (x | theta) P (theta), due to the prior distribution P (x) x is fixed (can be obtained by analyzing the data), so can be turned into P (x | theta) to maximize the function P (theta)/P (x), according to the bayesian rule, to maximize the function of P (x | theta) P (theta)/P (x) = P (theta | x), Therefore is to maximize the function of p (theta | x), and p (theta | x) is the posterior probability of theta.
In the maximum likelihood estimation, P(θ)=1 because θ is considered fixed.
3. Maximum posteriori, maximum likelihood association and difference:
Maximum posterior estimation not only focuses on the current sample, but also allows us to add prior knowledge to the estimation model, which is useful when the sample size is small. The difference between maximum posteriori and maximum likelihood is that we have a different understanding of the parameter θ.
-
The idea of maximizing a posterior probability is that the parameter itself is subject to some underlying distribution that needs to be considered. The prior probability density function is known and is P(θ).
-
Maximum seems to think that the parameter is a fixed value, not some random variable.
In fact, maximum posterior probability estimation is the maximum likelihood with a prior probability parameter (prior distribution of parameters to be estimated). It can also be considered that maximum likelihood estimation is to regard prior probability as a fixed value. That is, if P(θ) is assumed to be uniformly distributed, the Bayesian method is equivalent to the frequency method. Intuitively speaking, a priori is uniform distribution which essentially means that there is no prediction of anything, so the maximum posteriori and maximum likelihood are equal.
4. Solving steps of maximum posterior probability estimation:
-
Determine the prior distribution and likelihood function of parameters
-
Determine the posterior distribution function of the parameters
-
Convert a posterior distribution function to a logarithmic function
-
Find the maximum value of the logarithm function (take the derivative, solve the equation)
5. Maximum posterior probability estimation in water Margin
The risk of maximum likelihood estimation is that there may be discriminant error if the sample size is insufficient.
The difference between maximum posteriori and maximum likelihood: Maximum posteriori allows us to add prior knowledge into the estimation model, which is very useful when there are few samples.
Shi Xiu killed Pei Ruhai/Pan Qiaoyun.
Desperately three lang Shi Xiu is what person?
-
First, he was a small businessman, “selling sheep and horses/selling wood/running a slaughterhouse”, and he had to do things with evidence/reasoning/deliberation.
-
Secondly, he is rare in Liangshan “bold and careful”, can handle the zhujiazhuang maze, can also save lu.
Both professional characteristics and personality characteristics determine that he will not simply consider the “maximum likelihood”, but will combine the “prior conditions” to implement the “maximum posterior probability estimation”, that is, the “prior conditions” & “samples” must be the largest together.
Shi Xiu’s previous sample experience (prior knowledge) is underlined in bold below: Pan Qiaoyun had spoken to Shi Xiu several times before.
Suppose the distribution is P= P (x; θ), x is the sample of occurrence, θ is the parameter of generation estimation, p(x; θ represents the probability of x occurring when the estimated parameter is θ.
θ = The relationship between Pan Qiaoyun and Pei Ruhai, may be the value of ordinary pilgrims and monks/sworn siblings/adultery…..
X = observed data of Shi Xiu
Shi Xiu obtained the observation samples through “more than 10 consecutive secret observations”, and then “observation data (samples) + prior knowledge –> the parameter with the maximum probability of occurrence”.
Theta is “having an affair”.
In his book, Shi Xiu’s iterative process of “transcendental + observation -> reasoning” and psychological state are described in a profound way.
Back to the forty-fourth Yang Xiong drunk scold Pan Qiao-yun Shi Xiu-zhi pei Ruhai
Shi Xiudao: “so so.” I’ve already seen one point in my stomach. At bree’s first glance Shi Xiu looked at him half way and said, “‘ Don’t believe in straightness. ‘I have often seen that the old woman often only to say some wind to me, I only as a sister-in-law in general. So this woman is not a good man! Mo teach hit in the hands of Shi Xiu, dare to do a show for Yang Xiong also not likely!” Shi Xiu thought, a look at the three points of the department, then uncovered cloth, bump will come out. Shi Xiu in front of the door low head only thinking, in fact, the heart has seen four points. Shi Xiu looked at the department, enough to have five points not happy. Shixiu unhappy, at this time really to six points, only push abdominal pain, since sleep in the board after the wall. Don’t want to shixiu in the wall after the false sleep, is look look, has seen seven points. Shi Xiu from lo division eight. Shi Xiu, who was a morose person, looked nine minutes earlier. In the cold ground, she thought, “This lane is a dead lane. How come there is this toutuo, knocking wood fish here for days is called Buddha?” Shi Xiu heard the groan, so she jumped up and stretched herself in the crack of the door. She saw a man wearing a turban flash out of the shadow and go with Toutuo. Then the door closed. Shi Xiu looked at ten.
0x04 Bayesian estimation
1. Expand the MAP
Bayes school has a hard problem: why to choose priori? If you choose a strong but deviating prior, MAP may be worse than MLE. So bayesian estimation is going to be extended on MAP. How? Here’s the idea:
First, both MLE and MAP treat the parameter θ as an unknown deterministic variable. MLE considers the parameter θ to be a fixed value. MAP assumes that the random variable θ has some probability distribution, and then MAP takes the peak value of the posterior distribution (mode, mode).
Second, modes tend not to be very representative (especially in multi-modal functions). So instead of making do with the peak value of the posterior distribution, let’s figure out the entire posterior distribution, and use a distribution to describe the parameters to be evaluated. This is Inference.
So, Bayesian estimators also assume that θ is a random variable (subject to a probability distribution), but instead of directly estimating a specific value of θ, bayesian estimators estimate the distribution of θ, which is different from maximum posterior probability estimates. In Bayesian estimation, the prior distribution P(X) is not negligible.
2. The ideas:
The Bayes school of thought assumes that the world is uncertain, so it assumes an estimate (a prior probability) and then continually adjusts the previous estimate based on observed data. In layman’s terms, events are modeled not as if the model parameter θ is a definite value, but as if the parameter θ itself is subject to some underlying distribution.
The emphasis of Bayesian statistics is that parameters are unknown and uncertain, so as unknown random variables, parameters themselves are also a distribution. Meanwhile, prior probability of parameter θ can be obtained according to existing prior knowledge and sample information, and the posterior probability of θ can be inferred according to prior probability. The delayed probability is expected to have a spike at the true θ value.
Maximum likelihood estimation and maximum a posteriori probability estimation, the value of the parameter theta, the bayesian inference is not, it is according to the parameters of the prior distribution P (theta) and X series of observation, the posterior distribution of parameters theta P (theta | X), namely the a posteriori probability distribution P (theta | X) is the probability distribution of a series of parameter value of theta, And the simple thing is that we’ve got a bunch of parameters θ and their possibilities, and we just need to pick the values that we want.
3. There are three common methods
So how do you estimate parameters based on a posterior distribution? There are three common methods: the mode of the posterior distribution (i.e. the point with the highest posterior density), the median of the posterior distribution, and the mean of the posterior distribution.
Sometimes we want the parameter with the highest probability, which is the posterior mode estimator.
Sometimes we want to know the median of the parameter distribution, so this is the posterior median estimator;
Sometimes what we want to know is the mean of this parameter distribution, and that’s the posterior expectation estimate.
None of the three estimates is better or worse, but only provides three ways to derive parameters as needed to choose. The most commonly used is posterior expectation estimation, which is also referred to simply as Bayesian estimation. The steps of naive Bayesian algorithm based on Bayesian estimation and maximum likelihood estimation are basically the same, the difference is whether the probability is smoothed.
4. The relationship between MAP and Bayesian estimation
It is now clear that in Bayesian estimation, if we adopt the idea of Maximum likelihood estimation and consider the posterior distribution maximization to solve θ, and then select the peak value (mode, mode) of the posterior distribution, it becomes Maximum A Posteriori estimation (MAP).
As an approximate solution of Bayesian estimation, MAP has its value because the calculation of a posterior distribution in Bayesian estimation is often very tricky. Moreover, MAP does not simply return to maximum likelihood estimation, but still uses prior information that cannot be obtained from observed samples.
5. Compare:
-
Maximum likelihood estimation, maximum posterior estimation and Bayesian estimation are all parameter estimation methods.
-
Maximum likelihood estimation and maximum a posteriori estimation are point estimators, that is, parameters are regarded as unknown constants and are realized by maximizing likelihood and a posteriori probability.
-
Bayesian estimation regards the parameter as a random variable, which belongs to distribution estimation, and then calculates the conditional expectation of the random variable in data set D.
-
When the priors are uniformly distributed, the maximum likelihood estimator and the maximum posterior estimator are equivalent. That is, the prior probability of the estimated parameters is 1;
-
When both prior and likelihood are Gaussian distributions, the maximum posterior estimation and Bayesian estimation are equivalent.
-
In general, integrals for Bayesian estimates are difficult to calculate, but approximations such as Laplace and variational approximations and Markov chain Monte Carlo sampling can be adopted.
-
Bayesian estimation relative to the maximum a posteriori estimation is that the benefits of bayesian estimation to calculate the posterior probability distribution, which is able to find out some other like the distribution of the variance of value for reference to, such as the calculated variance is too big, we can think of distribution is not good enough, so as to make the choice parameters of a factor. In fact, Bayesian estimation “pulls” the estimated results closer to the prior results than MAP does, bringing the estimated results closer to the prior results.
-
Applications of Bayesian estimation include the LDA topic model. LDA topic model calculates topic distribution and word distribution by the properties of conjugate distribution.
6. Solving steps of Bayesian estimation:
-
Determine the likelihood function of the parameter
-
The prior distribution of parameters should be the conjugate prior of the posterior distribution
-
Determine the posterior distribution function of the parameters
-
The posterior distribution of parameters is solved by Bayesian formula
0 x05 reference
Likelihood and likelihood functions
Detailed explanation of maximum likelihood estimation (MLE), maximum posterior probability estimation (MAP), and understanding of Bayesian formula
Logistic regression >>>>> maximum likelihood >>>>> maximum posteriori probability
Bayesian estimation, maximum likelihood estimation, maximum posterior probability estimation
Talk about MLE and MAP for machine learning: maximum likelihood estimation and maximum posterior estimation
Maximum Likelihood Estimation (MLE) & Maximum Posterior Probability Estimation (MAP)
Understanding and application of maximum likelihood estimation
Maximum Likelihood Estimation
Maximum likelihood estimation in detail
Parameter estimation (2) : maximum likelihood, maximum posteriori, Bayesian inference and maximum entropy
Maximum likelihood estimation and Bayesian estimation
Maximum likelihood estimation and maximum posterior probability estimation
What is your understanding of Bayesian statistics?
Comparison of maximum likelihood estimation, maximum posterior estimation and Bayesian estimation
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.
This article is formatted using MDNICE