How does the Bayesian approach relate to Ridge regression? Let’s cut to the chase.

To illustrate the problem, consider one-dimensional independent variables and arrange a series of independent variables in the form of vectors: X = (x1,.., xN) T \ mathbf {x} = (x_1, \ \ cdots, x_N) ^ Tx = (x1,.., xN) T, the corresponding objective function for T = (t1,.., tN) T \ mathbf {T} = (t_1, \ \ cdots, t_N) ^ Tt = T (t1,.., tN).

We assume that each TTT in the sample is independent and follows a normal distribution, and the mean of the distribution is y(x,w)=∑j=0Mwjxjy(x,\mathbf{w})=\ Sum_ {j=0}^{M} w_j x^jy(x,w)=∑j=0Mwjxj (x,w). As long as it is a function of XXX and W \mathbf{w}w), the reciprocal of the variance is β\betaβ, then the likelihood function is


p ( t x . w . Beta. ) = n = 1 N N ( t n y ( x . w ) . Beta. 1 ) p(\mathbf{t}|\mathbf{x},\mathbf{w},\beta)=\prod_{n=1}^{N} \mathcal{N}(t_n|y(x,\mathbf{w}),\beta^{-1})

Take the logarithm of the likelihood function, and then write out the specific form of the normal distribution


ln p ( t x . w . Beta. ) = Beta. 2 n = 1 N [ y ( x n . w ) t n ] 2 + N 2 ln Beta. N 2 ln ( 2 PI. ) \ln{p(\mathbf{t}|\mathbf{x},\mathbf{w},\beta)}=-\dfrac{\beta}{2}\sum_{n=1}^{N}[y(x_n,\mathbf{w})-t_n]^2+\dfrac{N}{2}\ln{ \beta}-\dfrac{N}{2}\ln(2\pi)

Maximum likelihood function is equivalent to minimize its negative logarithm, also is equivalent to minimizing the ∑ n = 1 n [(xn, w) – y tn] 2 \ sum_ ^ {n = 1} {n} [y (x_n, \ mathbf {w}) – t_n] ∑ n ^ 2 = 1 n [(xn, w) – y tn] 2. And what we found was that this was actually solving the linear regression problem with OLS. In other words, solving linear regression with OLS is equivalent to solving the maximum likelihood problem under the assumption of normal distribution.

So what’s going to happen with the Bayesian approach? Since the Bayesian method requires a prior distribution of parameters, it is assumed that the prior distribution of parameter W \mathbf{w} W is a simple normal distribution controlled by the hyperparameter α\alphaα. Note that this is a multidimensional normal distribution:


p ( w Alpha. ) = N ( w 0 . Alpha. 1 I ) = ( Alpha. 2 PI. ) M + 1 2 exp ( Alpha. 2 w T w ) \begin{aligned} p(\mathbf{w}|\alpha)&=\mathcal{N}(\mathbf{w}| \mathbf{0},\alpha^{-1}\mathbf{I})\\ &=(\dfrac{\alpha}{2\pi})^{\dfrac{M+1}{2}}\exp(-\dfrac{\alpha}{2}\mathbf{w}^T \mathbf{w}) \end{aligned}

Where M+1M+1M+1 is the total number of elements of W \mathbf{w}w.

According to Bayes’ theorem, there are


p ( w x . t . Alpha. . Beta. ) p ( t x . w . Beta. ) p ( w Alpha. ) p(\mathbf{w}|\mathbf{x},\mathbf{t},\alpha,\beta)\propto p(\mathbf{t}|\mathbf{x},\mathbf{w},\beta)p(\mathbf{w}|\alpha)

What we want to maximize is the posterior probability of W \mathbf{w}w, and such a method is MAP (maximum posterior).

After taking the negative logarithm of the right side of the above equation and removing the terms irrelevant to W \mathbf{w}w, it becomes:


Beta. 2 n = 1 N [ y ( x n . w ) t n ] 2 + Alpha. 2 w T w \dfrac{\beta}{2}\sum_{n=1}^{N}[y(x_n,\mathbf{w})-t_n]^2+\dfrac{\alpha}{2}\mathbf{w}^T\mathbf{w}

We found that by adding the assumption of zero mean, homovariance, and uncorrelated multidimensional normal distribution of parameters to the original assumption of normal distribution of data, the optimal thing for bayesian methods is the optimal thing for Ridge regression, Take regularization parameter lambda = alpha beta \ lambda = \ dfrac {\ alpha} {\ \ beta} lambda = beta alpha, the result is the same.