Preface Many people do not understand why a 1 should be added to the front of sample X in neural network and logistic regression, so that X=[x1,x2… xn] becomes X=[1,x1,x2… xn]. So you might make all kinds of mistakes, such as missing this 1, or incorrectly adding this 1 to the result of W·X, leading to all kinds of bugs in the model and even failure to converge. I still don’t understand what this offset term does.
In the articles “Logistic regression” and “From logistic regression to Neural Network”, Xiao Xi often ignores the bias term B of the model in order to concentrate his argument, but it does not mean that it can also be ignored in practical engineering and rigorous theory. On the contrary, this is often important.
In the article “From logistic regression to Neural Network”, Xiao Xi explained that a traditional neural network can be regarded as the output of multiple logistic regression models as the input of another logistic regression model “combined model”. Therefore, discussing the role of bias term B in neural network is approximately equivalent to discussing the role of bias term B in logistic regression model.
So, to reduce our thinking, let’s start with the bias term of the logistic regression model, which is essentially a review of high school math.
Based on review of
As we know, logistic regression model is essentially a function y=WX+b to draw the decision surface, where W is the model parameter, which is the slope of the function (remember y=ax+b) and b is the intercept of the function.
In the one-dimensional case, let W=[1] and b=2. Then y=WX+b is as follows (a line with intercept 2 and slope 1) :
In two dimensions, W=[1 1], b=2, y=WX+b is as follows (a plane with intercept 2 and slope [1 1])
Obviously, the function y=WX+b is the line/plane/hyperplane of 2 dimensions /3 dimensions/higher dimensions. So logistic regression is of course a linear classifier. So if we didn’t have this offset b, then we would only be able to draw the line/plane/hyperplane through the origin in space. For most cases, such as the one below, it would be a disaster to ask the decision plane to cross the origin.
Therefore, for logistic regression, this bias term B must be added to ensure that our classifier can draw the decision surface anywhere in space (although the drawing must be straight, not curved, teh…). .
In the same way, the bias term B should be added to the neural network composed of multiple logistic regression. But think about it, if there are three nodes in the hidden layer, it’s like there are three logistic regression classifiers. Each of the three classifiers draws its own decision plane, so the bias term B will be different in general. For example, the following complex decision boundary may be drawn by a neural network of three hidden nodes:
How to intelligently assign different B’s to three classifiers (hidden nodes)? Or what if the model dynamically adjusts the B of the three classifiers in the training process to draw the optimal decision surface of each classifier?
(1, x1,x2…) (X +1, X +1, X +1, X +1, X +1) ), and then let each classifier train its own bias weight, so the weight of each classifier also becomes n+1 dimension, namely [w0,w1… , where, w0 is the weight of the offset item, so 1*w0 is the bias/intercept of this classifier. In this way, the intercept B, which seems to be different from the slope W, is unified under a framework, so that the model can adjust parameter W0 constantly during the training process, so as to achieve the purpose of adjusting B.
So, if you’re writing neural network code, and you leave out the bias term, the neural network is likely to become very poor, converge very slowly and with poor accuracy, or even go into a “zombie” state that doesn’t converge. So, unless you have a very good reason to get rid of bias b, don’t dismiss it as small.