Alink’s Ramble (11) : L-BFGS optimization for linear Regression
0 x00 the
Alink is a new generation of machine learning algorithm platform developed by Alibaba based on real-time computing engine Flink. It is the first machine learning platform in the industry that supports both batch algorithm and streaming algorithm. This article introduces how linear regression l-BFGS optimization can be achieved in Alink, hoping that it can be used as a Roadmap for linear regression code.
Since Alink’s public information is too little, the following is my own speculation, and there will definitely be omissions. I hope you can point out that I will update at any time.
0 x01 review
So far, the input has been processed, see Alink’s ramble (10) : Data preprocessing for linear regression implementation, followed by optimization. The main objective of optimization is to find a direction in which the value of the loss function can be reduced by moving the parameter. This direction is usually obtained by the combination of the first partial derivative or the second partial derivative. So let’s go over the basic idea again.
1.1 Basic ideas of optimization
For the linear regression model f(x) = w’x+e, we construct a Cost function (loss function) J(θ), and by finding the minimum value of J(θ), we can determine the coefficient w of the linear model.
The final optimization function is: min(L(Y, f(x)) + J(x)), namely optimization empirical risk and structural risk, and this function is called the objective function.
So what we’re going to do is we’re going to pick the best theta based on our training set, and we’re going to make f of x as close to the real value as possible in our training set. We define a function to describe “the difference between f(x) and the true value y”. This function is called the objective function and is expressed as follows:
The target function here is the famous least square function.
We want to choose the optimal θ so that f of x is closest to the true value. This problem is then transformed into finding the optimal θ so that the objective function J(θ) is minimized.
1.2 Various optimization methods
Finding an appropriate W to minimize the objective function f(W) is an unconstrained optimization problem. The general approach to solve this problem is to randomly give an initial W0, and through iteration, calculate the descending direction of the objective function and update W in each iteration until the objective function stabilizes at the minimum point.
The difference between different optimization algorithms lies in the calculation of Dt in the descending direction of the objective function. The descending direction is obtained by taking the reciprocal of the first order (Gradient) and the second derivative (Hessian Matrix) of the objective function under the current W. The common algorithms are gradient descent method, Newton method, quasi-Newton method.
- The gradient descent method uses the objective function directly in the presentWThe opposite direction of the gradient is the descending direction.
- Newton’s method is currentWThe quadratic Taylor expansion is used to approximate the objective function, and then the descending direction of the objective function is solved by using the approximate function. Among themBtIs the objective functionf(W)inWtThe Heisen matrix of phi. This search direction is also called the Newtonian direction.
- The quasi Newtonian method only requires that the gradient of the objective function be calculated in each iteration and an approximate Hessian matrix be found to calculate the Newtonian direction by means of fitting.
- L-bfgs (limited-memory BFGS) solves the need to save after each iteration in BFGSN*NThe heisen inverse problem requires only two sets of vectors and one set of scalars for each iteration.
In Alink, UnaryLossObjFunc is the objective function, SquareLossFunc is the loss function, and l-BFGS algorithm is used to optimize the model.
That is, the optimization direction is determined by the quasi-Newton method L-BFGS (the specific way is to see the Taylor second-order expansion of F (x)), and the minimum loss function is calculated by the square loss function.
0x02 Basic Concepts
Because l-BFGS algorithm is a kind of quasi-Newton method, we first introduce the essence of Newton’s method from Taylor expansion.
2.1 Taylor expansion
Taylor expansion hopes to approximate the local part of a complex function f(x) by using a set of simple power functions based on x_0 expansion at a point in a certain interval. For example: It is difficult for us to get the value of sin(1) directly, but we can get the approximate value of sin(1) through the Taylor series of sin, and the more expansion terms, the higher the precision. The computer usually computes sine of x by Taylor expansion.
Why did Taylor invent this formula?
Because at that time, mathematics was mature for simple functions and applications, and complicated functions, such as f(x) = sin(x)ln(1+x), and curves where there was no expression at all. It’s hard to do anything except put in an x and get its y. So Tyler took the bull by the horns! I decided to make all of these things look real, all of them simple.
To make a complex function simpler, can I convert it to another expression? Taylor formula is described in a word: is to use polynomial function to approximate smooth function. Taylor’s formula was created based on the mathematical idea of “substituting music directly and dividing the whole into parts.”
Taylor’s formula is an extremely powerful function approximation tool by converting (rewriting) any function expression into a polynomial form. Why is it powerful?
- Polynomials are very friendly, easy to triple, easy to compute, easy to differentiate, easy to integrate
- Geometric feeling and computational feeling are very intuitive, such as parabola and a few power is the base by himself by himself a few times
How to reason popularly?
All Taylor’s formula does is use polynomial expressions to estimate (approximate) values in the vicinity.
When we want to copy something, we follow the following line of thought: first make sure it is similar in general, then make sure it is similar in parts, then make sure it is similar in details, then make sure it is similar in more details… After endless refinement, the imitation will be infinitely close to the real thing. It’s hard to tell the true from the false.
Physicists have come to the conclusion: the life experience of “fake” used in kinematics problem, if you want to copy a curve, then should first ensure that the curve as the starting point, the second guarantee starting point displacement as the rate of change over time (speed), should ensure that both the former equal again at the same time on time as the second order rate of change of acceleration (same)… If the rate of change over time is the same (every derivative), then the two curves must be completely equivalent.
So if Taylor wanted to find a way to keep himself away from functions like sine of x, he would just replace those functions. You can make an image that looks like the original image from the image of this function, and this behavior is called approximation in mathematics. Leave the noun alone. Talk about how to fake an image.
The first step in the imitation is to get the imitation curve to go through this point.
Once you’ve done the first step of the imitation, it’s rough, and you can’t even tell what the similarities are, so go on and detail. So let’s start thinking about the trend of the curve, the derivative, and make sure that the derivatives are equal here.
After the second step, now that we have the same starting point, the overall trend is similar, it might look a little bit interesting. For further refinement, concavity should be considered. I learned in high school that the parameter characterizing the concavity of an image is “derivative of derivative”. So the next step is to make the derivatives of these two equal.
With the same starting point, the same increase or decrease, the same concavity, the simulated functions are more similar. If we go further and further, we should get infinitely close. So Taylor thought, “To fake a curve, you have to make sure that the starting point is the same, and then make sure that the derivatives are the same here, and then make sure that the derivatives are the same here…”
Taylor’s expansion is just replacing a trigonometric function or an exponential function or some other kind of tricky function with a polynomial.
In other words, I have an original function f(x), and I create another polynomial function g(x) whose graph is similar to the original function. To ensure the similarity, I only need to ensure that the initial values of the two functions are equal at a certain point, the first derivatives are equal, the second derivatives are equal… The NTH derivatives are equal.
2.2 Newton’s method
The basic idea of Newton’s method is to find the next estimate of the minimum by performing a second-order Taylor expansion of f(x) near the existing estimate of the minimum. The core idea is to use the first derivative (gradient) and second derivative (Hessen matrix) at the iteration point X_K to perform quadratic function approximation to the objective function, and then take the minimum point of the quadratic model as the new iteration point, and repeat this process until the approximate minimum value satisfying the accuracy is obtained.
Gradient descent is the approximation of the function at xn, which is a line. Calculate the gradient and decide that the next direction of optimization is the opposite direction of the gradient. Newton’s method is a quadratic approximation of the function at xn, which is a quadratic. The gradient and the second derivative are calculated to determine the next optimization direction.
What we’re trying to optimize is functions of several variables, and x is not always a real number, it’s a vector. So the first derivative of f of x is also a vector, and the second derivative is a matrix, which is the Hessian matrix.
2.2.1 Taylor first-order expansion
Newton’s method can be used to find roots according to Taylor’s first order expansion. For example, for the equation f(x) = 0, we perform the first-order Taylor expansion at any point x0:
Let f(x) = 0, substitute into the above equation, and get:
Notice, I’m using an approximation here, and the x that I get is not a root of the equation, it’s just an approximation solution. But x is closer to the root of the equation than x0.
Then, the iterative method is used to solve, with x being x0, the position closer to the root of the next equation is solved. The iteration formula can be written as:
2.2.2 Taylor second-order expansion
In machine learning and deep learning, the optimization problem of loss function is generally based on the gradient descent of the first derivative. Now, another way of looking at it, if you want to minimize the loss function, this is actually a maximum value problem, where the first derivative of the function f prime of x is equal to 0. In other words, if we find the point x where f'(x) = 0, the loss function minimizes and the model optimization goal is achieved.
Now, the goal is to compute the root of f prime of x equals 0, the root of f prime of x equals 0. And when we do the root problem, we can use Newton’s method in the last video.
Different from the previous section, first, perform a second-order Taylor expansion for f(x) at x0:
This is true if f of x is approximately equal to f of x0. Let f(x) = f(x0), and take the derivative of (x-x0), we can get:
Again, even though x is not the optimal solution point, x is closer to the root of f'(x) = 0 than x0. In this way, the optimal iterative formula can be obtained:
By iterating the formula, the approximate root of f'(x) = 0 can be continuously found, thus achieving the optimization goal of minimizing the loss function.
2.2.3 High-dimensional space
In machine learning optimization problems, what we are trying to optimize are functions of several variables, so x is often not a real number, but a vector. Therefore, when Newton’s root method is applied to machine learning,x is a vector, the first derivative of f(x) is also a vector, and the second derivative of the derivative is a matrix, namely the Hessian matrix.
In high-dimensional space, we replace the first derivative with gradient and the second derivative with Hessian matrix. The iterative formula of Newton’s method remains unchanged:
Where J is defined as the Jacobian matrix, corresponding to the first partial derivative.
The derivation is as follows:
We assume that f(x) is a differentiable real function of the second order, and Taylor expands f(x) at xk and takes the second order approximation to
Our goal is to find the minimum value of F (x), and the point at which the derivative is 0 is very likely to be the extreme point, so take the derivative of F (x) here, and make its derivative 0, namely, ∇ F (x)=0, which can be obtained
Assuming ∇2f(x) reversible, the iterative formula of Newton’s method can be obtained from (2)
When the original function has first order and second order continuous differentiable, Newton’s method can be used for one-dimensional search, with fast convergence speed and local second order convergence speed.
2.2.4 Basic flow of Newton method
To summarize (imitate) the steps for finding roots using Newton’s method:
A. Random points are generated in the case of known functions. B. K iterations according to the formula of known points. C. If the result is the same as the last iteration or the difference is less than one threshold, the result is the root of the function.Copy the code
The pseudocode is as follows
def newton(feature, label, iterMax, sigma, delta) :
Input: feature(mat): feature label(mat): label iterMax(int): maximum number of iterations Sigma (float), delta(float): parameters in Newton's method Output: W (MAT): regression coefficient
n = np.shape(feature)[1]
w = np.mat(np.zeros((n, 1)))
it = 0
while it <= iterMax:
g = first_derivativ(feature, label, w) # first derivative
G = second_derivative(feature) # second derivative
d = -G.I * g
m = get_min_m(feature, label, sigma, delta, d, w, g) # get the smallest value of the step size, m
w = w + pow(sigma, m) * d
it += 1
return w
Copy the code
2.2.5 Problems and solutions
Newton’s method not only computes the Hessian matrix, but also computes the inverse of the Hessian matrix. When the amount of data is relatively small, the speed of operation will not be greatly affected. However, when the data volume is very large, especially in deep neural networks, computing the Hessian matrix and its inverse matrix is very time consuming. From the overall effect, the optimization speed of Newton method is not as fast as gradient descent algorithm. Therefore, most of the optimization strategies of neural network loss function are based on gradient descent.
Another problem is that if the Hessian matrix at a point is close to singular (the condition number is too large), the inverse matrix will cause numerical instability and even the iteration may not converge.
When x has many dimensions, it is very difficult for us to obtain the second derivative of F (x), and Newton’s method is an iterative algorithm to obtain the stagnation point, so we have to face infinite times of this difficulty, resulting in Newton’s method to obtain the stagnation point cannot be used in machine learning. Is there any solution?
In practical application, we usually do not solve the inverse matrix, but consider solving the approximate matrix of the Hessian matrix, it is best to only use the first derivative to approximate the information of the second derivative, so as to get the convergence rate close to the second order with less computation. This is the quasi Newtonian method.
The idea of quasi Newtonian method is very simple, just as the difference between two points approximates the derivative of a function, the difference between two points of the first derivative approximates the second derivative. The geometric meaning is to take the secant of the first derivative, and when you take the limit, the secant approaches the tangent, and the matrix B approaches the Hessian matrix.
2.3 Quasi-Newton method
The quasi-Newton method simulates the construction process of Hessen matrix and approximates it through iteration. It mainly includes DFP quasi-Newton method and BFGS quasi-Newton method. The quasi Newtonian method only requires that the gradient of the objective function be known at each iteration step.
Various quasi-Newton methods use iterative methods to approximate the inverse of the Hessian matrix and itself respectively.
In various quasi-Newton methods, the general strategy for constructing Hk+1 is that H1 usually chooses an arbitrary n-order symmetric positive definite matrix (generally I), and then gives Hk+1 through continuous correction of Hk, i.e
For example, each update of matrix H(the inverse matrix of Hessian matrix) in BFGS method requires the iteration point difference S and step difference Y at the k step, and H at the k+1 step is equivalent to the values of S and Y used from the beginning to the K step.
We’re going to find the root of a function using Newton’s standing point method and BFGS, and both of these algorithms need to iterate, so we’ll just let them iterate together. Both algorithms approach the root of the function slowly, so after k iterations, the solution obtained is the root of the derivative function of the objective function in machine learning. This calculation method of joint iteration of The two algorithms is called On The Fly.
In the first step of BFGS algorithm iteration, the identity matrix multiplied by the gradient G is equal to the gradient G, which is formally the same expression as the gradient descent. Therefore, BFGS algorithm can be understood as an algorithm that gradually transforms from gradient descent to Newton method to solve the function.
Although we use BFGS algorithm to approach H matrix gradually by using the identity matrix, we still need to store the D-matrix and how big the D-matrix is in each calculation. ? Assuming that our dataset has 100,000 dimensions (not too large), the result of storing the D-matrix per iteration is 74.5GB. We can’t store the contents of such a large matrix. How can we solve this problem? L-bfgs algorithm is used.
2.4 L – BFGS algorithm
The basic idea of L-BFGS algorithm is that it only saves and uses the curvature information of the latest m iterations to construct the approximate matrix of Hessian matrix.
We’re going to find the root of a function using Newton’s standing point method and BFGS, and both of these algorithms need to iterate, so we’ll just let them iterate together. Both algorithms approach the root of the function slowly, so after k iterations, the solution obtained is the root of the derivative function of the objective function in machine learning. This calculation method of joint iteration of The two algorithms is called On The Fly.
In the first step of BFGS algorithm iteration, the identity matrix multiplied by the gradient G is equal to the gradient G, which is formally the same expression as the gradient descent. Therefore, BFGS algorithm can be understood as an algorithm that gradually transforms from gradient descent to Newton method to solve the function.
Although we use BFGS algorithm to approach H matrix gradually by using the identity matrix, we still need to store the D-matrix and how big the D-matrix is in each calculation. ? Assuming that our dataset has 100,000 dimensions (not too large), the result of storing the D-matrix per iteration is 74.5GB. We can’t store the contents of such a large matrix. How can we solve this problem? L-bfgs algorithm is used.
Every time we iterate over the D matrix, we get it by iterating sk and yk. When we run out of memory, we just throw away some data that can’t be stored. Assuming we set the number of storage vectors to 100, the first S and y are thrown away when the S and Y iterations exceed 100. For each additional iteration, discard the leading S and y. In this way, although the accuracy is lost, the solution of the function can be obtained by BFGS algorithm with limited memory. Therefore, L-BFGS algorithm can be understood as another approximation or approximation of BFGS algorithm.
I’m not going to do mathematical arguments here, because there are a lot of good articles on the web, but I’m just going to do engineering. The general steps of l-BFGS algorithm are summarized as follows:
Step1: Select the initial point X_0 to store the number of recent iterations M; Step2: k=0, H_0=I, r= F (x_0); Step3: calculate the gradient and loss value according to the updated parameters. If the threshold value is reached, return the optimal solution x_{k+1}; otherwise, go to Step4; Step4: calculate the feasible gradient descent direction p_k=-r_k for this iteration; Step5: Calculate the step size alpha_K and conduct one-dimensional search; Step6: Update weight X; Step7: Only keep the vector pairs of the most recent m times; Step8: calculate and save s_k, y_K Step9: use two-loop recursion algorithm to calculate R_K; K =k+1, go to Step3.Copy the code
0x03 Optimization Model — L-BFGS algorithm
Back to the code, BaseLinearModelTrainBatchOp. Optimize function call is
return new Lbfgs(objFunc, trainData, coefficientDim, params).optimize();
Copy the code
After optimization, the coefficients of the linear model will be returned.
/**
* optimizer api.
*
* @return the coefficient of linear problem.
*/
@Override
public DataSet <Tuple2 <DenseVector, double[]>> optimize() {
/** * solving problem using iteration. * trainData is the distributed samples. * initCoef is the initial model coefficient, which will be broadcast to every worker. * objFuncSet is the object function in dataSet format * .add(new PreallocateCoefficient(OptimName.currentCoef)) allocate memory for current coefficient * .add(new PreallocateCoefficient(OptimName.minCoef)) allocate memory for min loss coefficient * .add(new PreallocateLossCurve(OptimVariable.lossCurve)) allocate memory for loss values * .add(new PreallocateVector(OptimName.dir ...) ) allocate memory for dir * .add(new PreallocateVector(OptimName.grad)) allocate memory for grad * .add(new PreallocateSkyk()) allocate memory for sK yK * .add(new CalcGradient(objFunc)) calculate local sub gradient * .add(new AllReduce(OptimName.gradAllReduce)) sum all sub gradient with allReduce * .add(new CalDirection()) get summed gradient and use it to calc descend dir * .add(new CalcLosses(objFunc, OptimMethod.GD)) calculate local losses for line search * .add(new AllReduce(OptimName.lossAllReduce)) sum all losses with allReduce * .add(new UpdateModel(maxIter, epsilon ...) ) update coefficient * .setCompareCriterionOfNode0(new IterTermination()) judge stop of iteration */
DataSet <Row> model = new IterativeComQueue()
.initWithPartitionedData(OptimVariable.trainData, trainData)
.initWithBroadcastData(OptimVariable.model, coefVec)
.initWithBroadcastData(OptimVariable.objFunc, objFuncSet)
.add(new PreallocateCoefficient(OptimVariable.currentCoef))
.add(new PreallocateCoefficient(OptimVariable.minCoef))
.add(new PreallocateLossCurve(OptimVariable.lossCurve, maxIter))
.add(new PreallocateVector(OptimVariable.dir, new double[] {0.0, OptimVariable.learningRate}))
.add(new PreallocateVector(OptimVariable.grad))
.add(new PreallocateSkyk(OptimVariable.numCorrections))
.add(new CalcGradient())
.add(new AllReduce(OptimVariable.gradAllReduce))
.add(new CalDirection(OptimVariable.numCorrections))
.add(new CalcLosses(OptimMethod.LBFGS, numSearchStep))
.add(new AllReduce(OptimVariable.lossAllReduce))
.add(new UpdateModel(params, OptimVariable.grad, OptimMethod.LBFGS, numSearchStep))
.setCompareCriterionOfNode0(new IterTermination())
.closeWith(new OutputModel())
.setMaxIter(maxIter)
.exec();
return model.mapPartition(new ParseRowModel());
}
Copy the code
So the next step is to look at Lbfgs, which focuses on the descending direction of parameter updates and the search step size.
3.1 How to implement distributed implementation
If because all the input data is the same dimension; Without modifying the input during the algorithm, the input data can be shred. In this case, it should be possible to do a map-reduce calculation.
Let’s clarify the steps for distributed parallel computation in L-BFGS:
- Computing gradients can be parallelized, such as gradient 1 on machine 1 and gradient 2…. on machine 2 And then a Reduce is used to synthesize these into a complete gradient vector.
- The calculation direction can be parallelized, and the direction can also be obtained by partitioning data, calculating the values on the partitions by Map, and combining Reduce.
- The calculation of loss can be parallelized, and the loss can also be obtained by partitioning data, then calculating the value of each partition by Map and combining Reduce.
In Alink, AllReduce function is used instead of Flink native Map/Reduce to complete the parallel computation and communication of the above three points.
3.2 CalcGradient
Line search has only two steps, determine direction, determine step size. Both orientation determination and simulation of Hessian matrices require calculating gradients. In the calculation of gradient vector of objective function, only dot product and addition of vectors are required. Each iteration process can be easily divided into independent calculation steps, which are independently calculated by different nodes, and then merged to calculate the results.
Alink distributed the sample feature vectors to different computing nodes, and each computing node completed the dot product and sum calculation of the samples it was responsible for, and then merged the calculation results, thus realizing the parallel LR by row.
In the actual situation, there are also scenarios of logistic regression for high-dimensional feature vectors, which cannot meet the requirements of such scenarios only by parallel processing in rows. Therefore, high-dimensional feature vectors need to be divided into several smaller vectors by column for solution. This may be a point that Alink needs to optimize in the future.
CalcGradient is a derived class of Alink iterator, and the function is summarized as follows:
- Obtain processed input data from labledVectors,
- Get “iteration parameter coef”,” optimization function objFunc” and “gradient” from static memory;
- Calculate the local gradient objc. CalcGradient (labledVectors, COef, Grad.f0); This calls the gradient-dependent API of the target function;
objFunc.calcGradient
Calculate the gradient from the sampling points. Alink, what we’re doing here is we’re taking the gradient of the loss function by plugging in x and y and taking the partial derivative with respect to Coef. We have already mentioned above.- For each sample in the calculation sample, the computational gradients of different features are calculated separately.
- After multiplying the newly calculated gradient by the weight, it is stored in static memory gradAllReduce.
- Subsequently, the gradient of all the features of the calculated samples will be accumulated through the aggregation function to obtain the cumulative gradient and loss of each feature.
public class CalcGradient extends ComputeFunction {
/** * object function class, it supply the functions to calc local gradient (or loss). */
private OptimObjFunc objFunc;
@Override
public void calc(ComContext context) {
Iterable<Tuple3<Double, Double, Vector>> labledVectors = context.getObj(OptimVariable.trainData);
// Processed input data
labledVectors = {ArrayList@9877} size = 4
0 = {Tuple3@9895} "(1.0, 16.8, 1.0, 1.0, 1.4657097546055162, 1.4770978917519928)"
1 = {Tuple3@9896} "(1.0, 6.7, 1.0, 1.0-0.338240712601273-0.7385489458759964)"
2 = {Tuple3@9897} "(1.0, 6.9, 1.0, 1.0-0.7892283294029703-0.3692744729379982)"
3 = {Tuple3@9898} "(1.0, 8.0, 1.0, 1.0-0.338240712601273-0.3692744729379982)"
// get iterative coefficient from static memory.
Tuple2<DenseVector, Double> state = context.getObj(OptimVariable.currentCoef);
int size = state.f0.size();
// Yes Coef, 1.7976931348623157E308 is the default value
state = {Tuple2@9878} "(0.001 0.0 0.0 0.0, 1.7976931348623157 e308)"
f0 = {DenseVector@9879} "0.001 0.0 0.0 0.0"
f1 = {Double@9889} 1.7976931348623157 e308
DenseVector coef = state.f0;
if (objFunc == null) {
objFunc = ((List<OptimObjFunc>)context.getObj(OptimVariable.objFunc)).get(0);
}
// The variables are as follows
objFunc = {UnaryLossObjFunc@9882}
unaryLossFunc = {SquareLossFunc@9891}
l1 = 0.0
l2 = 0.0
params = {Params@9892} "Params {featureCols=["f0","f1","f2"], labelCol="label", predictionCol="linpred"}"
Tuple2<DenseVector, double[]> grad = context.getObj(OptimVariable.dir);
// The variables are as follows
grad = {Tuple2@9952} "(0.0 0.0 0.0 0.0 0.0,[0.0, 0.1])"
f0 = {DenseVector@9953} "0.0 0.0 0.0 0.0"
f1 = {double[2] @9969}
coef = {DenseVector@9951} "0.001 0.0 0.0 0.0"
data = {double[4] @9982}
Calculate local gradient, use the objective function
Double weightSum = objFunc.calcGradient(labledVectors, coef, grad.f0);
// prepare buffer vec for allReduce. the last element of vec is the weight Sum.
double[] buffer = context.getObj(OptimVariable.gradAllReduce);
if (buffer == null) {
buffer = new double[size + 1];
context.putObj(OptimVariable.gradAllReduce, buffer);
}
for (int i = 0; i < size; ++i) {
buffer[i] = grad.f0.get(i) * weightSum;
}
/* the last element is the weight value */buffer[size] = weightSum; }}// The final result is
buffer = {double[5] @9910}
0 = -38.396
1 = -38.396
2 = -14.206109929253465
3 = -14.364776997288134
4 = 0.0
Copy the code
3.3 AllReduce
Here is the above mentioned “accumulate the gradient of all the features of the calculated samples through the aggregation function to obtain the cumulative gradient and loss of each feature”.
.add(new AllReduce(OptimVariable.gradAllReduce))
Copy the code
For details on how AllReduce works, see the article
Alink ramble (3) Sharing of AllReduce communication model
(3)AllReduce communication model AllReduce
3.4 CalDirection
Now, the gradient that you get is the gradient that you get after the aggregation, so you can start calculating the direction.
3.4.1 Pre-allocation
Optimvariable. grad is pre-assigned.
public class PreallocateSkyk extends ComputeFunction {
private int numCorrections;
/**
* prepare hessian matrix of lbfgs method. we allocate memory fo sK, yK at first iteration step.
*
* @param context context of iteration.
*/
@Override
public void calc(ComContext context) {
if (context.getStepNo() == 1) {
Tuple2<DenseVector, double[]> grad = context.getObj(OptimVariable.grad);
int size = grad.f0.size();
DenseVector[] sK = new DenseVector[numCorrections];
DenseVector[] yK = new DenseVector[numCorrections];
for (int i = 0; i < numCorrections; ++i) {
sK[i] = new DenseVector(size);
yK[i] = newDenseVector(size); } context.putObj(OptimVariable.sKyK, Tuple2.of(sK, yK)); }}}Copy the code
3.4.2 Computing direction
In the process of calculation, it is necessary to continuously calculate and store the historical Hessian matrix. In l-BFGS algorithm, only the information of the latest m iterations can be retained to fit the Hessian matrix. In l-BFGS algorithm, complete Hk is no longer saved, but vector sequences {sk} and {yk} are stored. When matrix Hk is needed, vector sequences {sk} and {yk} can be calculated, and not all vector sequences {sk} and {yk} need to be saved, as long as the latest m-step vector is saved.
The specific principles and formulas are no longer described here, and many articles on the Internet are very good.
It is important to note that the various data uses of dir are:
dir = {Tuple2@9931} "(-9.599-9.599-3.5515274823133662-3.5911942493220335,[4.0, 0.1])"
f0 = {DenseVector@9954} "- 9.599-9.599-3.5515274823133662-3.5911942493220335" / / gradient
f1 = {double[2] @9938}
0 = 4.0 / / weight
1 = 0.1 // The learning rate, 0.1, is the initial value, which will be updated when UpdateModel is updated
Copy the code
The code summary is as follows:
@Override
public void calc(ComContext context) {
Tuple2 <DenseVector, double[]> grad = context.getObj(OptimVariable.grad);
Tuple2 <DenseVector, double[]> dir = context.getObj(OptimVariable.dir);
Tuple2 <DenseVector[], DenseVector[]> hessian = context.getObj(OptimVariable.sKyK);
int size = grad.f0.size();
// Gradarr is the result of CalcGradient in the previous stage
double[] gradarr = context.getObj(OptimVariable.gradAllReduce);
/ / variable is
gradarr = {double[5] @9962}
0 = -38.396
1 = -38.396
2 = -14.206109929253465
3 = -14.364776997288134
4 = 4.0
if (this.oldGradient == null) {
oldGradient = new DenseVector(size);
}
// Hessian is used as a queue to store only the latest M sK and yK
DenseVector[] sK = hessian.f0;
DenseVector[] yK = hessian.f1;
for (int i = 0; i < size; ++i) {
/ / gradarr [size] is weight
grad.f0.set(i, gradarr[i] / gradarr[size]); //size = 4
}
// Assign the gradient, which is all divided by the weight
grad = {Tuple2@9930} "(9.599-9.599-3.5515274823133662-3.5911942493220335, [0.0])"
f0 = {DenseVector@9937} "- 9.599-9.599-3.5515274823133662-3.5911942493220335"
data = {double[4] @9963}
0 = -9.599
1 = -9.599
2 = -3.5515274823133662
3 = -3.5911942493220335
f1 = {double[1] @9961}
0 = 0.0
dir.f1[0] = gradarr[size]; / / weight
int k = context.getStepNo() - 1;
if (k == 0) { // First iteration
dir.f0.setEqual(grad.f0); // Gradient is assigned here
oldGradient.setEqual(grad.f0);
} else {
yK[(k - 1) % m].setEqual(grad.f0);
yK[(k - 1) % m].minusEqual(oldGradient);
oldGradient.setEqual(grad.f0);
}
// copy g_k and store in qL
dir.f0.setEqual(grad.f0); // Copy the gradient here
//
dir = {Tuple2@9931} "(-9.599-9.599-3.5515274823133662-3.5911942493220335,[4.0, 0.1])"
f0 = {DenseVector@9954} "- 9.599-9.599-3.5515274823133662-3.5911942493220335" / / gradient
f1 = {double[2] @9938}
0 = 4.0 / / weight
1 = 0.1 // The learning rate, 0.1 is the initial value
// compute H^-1 * g_k
int delta = k > m ? k - m : 0;
int l = k <= m ? k : m; // m = 10
if (alpha == null) {
alpha = new double[m];
}
// Two-loop process, using the quasi-Newton method to compute the Hessian matrix
for (int i = l - 1; i >= 0; i--) {
int j = (i + delta) % m;
double dot = sK[j].dot(yK[j]);
if (Math.abs(dot) > 0.0) {
double rhoJ = 1.0 / dot;
alpha[i] = rhoJ * (sK[j].dot(dir.f0)); / / calculate alpha
dir.f0.plusScaleEqual(yK[j], -alpha[i]); // re-fix d}}for (int i = 0; i < l; i++) {
int j = (i + delta) % m;
double dot = sK[j].dot(yK[j]);
if (Math.abs(dot) > 0.0) {
double rhoJ = 1.0 / dot;
double betaI = rhoJ * (yK[j].dot(dir.f0)); / / multiplied by rho
dir.f0.plusScaleEqual(sK[j], (alpha[i] - betaI));// re-fix d}}}// The last is stored in optimvariable.dir
Copy the code
3.5 CalcLosses
The loss function error is calculated based on the updated dir and the current coefficient, and this loss is prepared for subsequent linear searches. The goal is to stop the iteration if the loss function error reaches the allowable range, and repeat the iteration otherwise.
The basic logic of CalcLosses is as follows:
- Double beta = dir.f1[1] / numSearchStep;
- For example, dir.f1[1] *= 1.0 / (numSearchStep * numSearchStep); Or dir.f1[1] = math.min (dir.f1[1], numSearchStep);
- 2) Call calcSearchValues of the target function to calculate the loss corresponding to the current coefficient;
- 3)calcSearchValues iterates through input labelVectors and calculates the loss for each labelVector according to the steps of linear search. vec[i] += weight * this.unaryLossFunc.loss(etaCoef – i * etaDelta, labelVector.f1); Inside the loop is as follows:
- 3.1) Y calculated by x-vec and COEF, etaCoef = getEta(labelVector, coefVector);
- 3.2) deltaY, etaDelta = getEta(labelVector, dirVec) * beta calculated from x-vec and dirVec;
- 3.3) Calculate the loss according to the steps of linear search. vec[i] += weight * this.unaryLossFunc.loss(etaCoef – i * etaDelta, labelVector.f1); According to the loss function, etacoef-I * etaDelta, LabelVector. f1 is the deviation between the predicted value of training data and the actual category.
- 4) Prepare data for subsequent aggregation of lossAllReduce;
The code is as follows:
public class CalcLosses extends ComputeFunction {
@Override
public void calc(ComContext context) {
Iterable<Tuple3<Double, Double, Vector>> labledVectors = context.getObj(OptimVariable.trainData);
Tuple2<DenseVector, double[]> dir = context.getObj(OptimVariable.dir);
Tuple2<DenseVector, Double> coef = context.getObj(OptimVariable.currentCoef);
if (objFunc == null) {
objFunc = ((List<OptimObjFunc>)context.getObj(OptimVariable.objFunc)).get(0);
}
/** * calculate losses of current coefficient. * if optimizer is owlqn, constraint search will used, else without constraint. */
Double beta = dir.f1[1] / numSearchStep;
double[] vec = method.equals(OptimMethod.OWLQN) ?
objFunc.constraintCalcSearchValues(labledVectors, coef.f0, dir.f0, beta, numSearchStep)
: objFunc.calcSearchValues(labledVectors, coef.f0, dir.f0, beta, numSearchStep);
/ / variable is
dir = {Tuple2@9988} "(-9.599-9.599-3.5515274823133662-3.5911942493220335,[4.0, 0.1])"
coef = {Tuple2@9989} "(0.001 0.0 0.0 0.0, 1.7976931348623157 e308)"
beta = {Double@10014} 0.025
vec = {double[5] @10015}
0 = 0.0
1 = 0.0
2 = 0.0
3 = 0.0
4 = 0.0
// prepare buffer vec for allReduce.
double[] buffer = context.getObj(OptimVariable.lossAllReduce);
if (buffer == null) {
buffer = vec.clone();
context.putObj(OptimVariable.lossAllReduce, buffer);
} else {
System.arraycopy(vec, 0, buffer, 0, vec.length); }}}Copy the code
The search is provided by the unary objective function, which in turn calls the loss function.
public class UnaryLossObjFunc extends OptimObjFunc {
/**
* Calculate loss values for line search in optimization.
*
* @param labelVectors train data.
* @param coefVector coefficient of current time.
* @param dirVec descend direction of optimization problem.
* @param beta step length of line search.
* @param numStep num of line search step.
* @return double[] losses.
*/
@Override
public double[] calcSearchValues(Iterable<Tuple3<Double, Double, Vector>> labelVectors,
DenseVector coefVector,
DenseVector dirVec,
double beta,
int numStep) {
double[] vec = new double[numStep + 1];
// labelVector is a triplet Tuple3
,>
labelVectors = {ArrayList@10007} size = 4
0 = {Tuple3@10027} "(1.0, 16.8, 1.0, 1.0, 1.4657097546055162, 1.4770978917519928)"
1 = {Tuple3@10034} "(1.0, 6.7, 1.0, 1.0-0.338240712601273-0.7385489458759964)"
2 = {Tuple3@10035} "(1.0, 6.9, 1.0, 1.0-0.7892283294029703-0.3692744729379982)"
3 = {Tuple3@10036} "(1.0, 8.0, 1.0, 1.0-0.338240712601273-0.3692744729379982)"
coefVector = {DenseVector@10008} "0.001 0.0 0.0 0.0"
dirVec = {DenseVector@10009} "- 9.599-9.599-3.5515274823133662-3.5911942493220335"
beta = 0.025
numStep = 4
vec = {double[5] @10026}
0 = 0.0
1 = 0.0
2 = 0.0
3 = 0.0
4 = 0.0
for (Tuple3<Double, Double, Vector> labelVector : labelVectors) {
double weight = labelVector.f0;
// Use x-vec and coef to calculate Y
double etaCoef = getEta(labelVector, coefVector);
// deltaY with x-vec and dirVec
double etaDelta = getEta(labelVector, dirVec) * beta;
weight = 1.0
etaCoef = 0.001
etaDelta = -0.7427013482280431
for (int i = 0; i < numStep + 1; ++i) {
// labelvector. f1 is label y
EtaCoef -i * etaDelta, labelVector.f1 is the deviation between the predicted value of training data and the actual category
vec[i] += weight * this.unaryLossFunc.loss(etaCoef - i * etaDelta, labelVector.f1); }}return vec;
}
private double getEta(Tuple3<Double, Double, Vector> labelVector, DenseVector coefVector) {
// labelvector. f2 = {DenseVector@9972} "1.0 1.0 1.4657097546055162 1.4770978917519928"
return MatVecOp.dot(labelVector.f2, coefVector);
}
}
vec = {double[5] @10026}
0 = 219.33160199999998
1 = 198.85962729259512
2 = 179.40202828917856
3 = 160.95880498975038
4 = 143.52995739431051
Copy the code
Review the loss function as follows
/** * Squared loss function. * https://en.wikipedia.org/wiki/Loss_functions_for_classification#Square_loss */
public class SquareLossFunc implements UnaryLossFunc {
public SquareLossFunc(a) {}@Override
public double loss(double eta, double y) {
return 0.5 * (eta - y) * (eta - y);
}
@Override
public double derivative(double eta, double y) {
return eta - y;
}
@Override
public double secondDerivative(double eta, double y) {
return 1; }}Copy the code
3.6 UpdateModel
This module does two things
- Update the coefficient based on dir and step length, which is calculated by direction and step size.
- If the understanding is simplified, the parameter update formula is: parameter at the next moment = parameter at the previous moment – learning rate * (derivative of the loss function to this parameter).
- Judge the convergence of the cycle.
Because there are so many variables, sometimes you forget who is who, so label again.
- Optimvariable. dir is the result of modifying the gradient calculated by CalcGradient
- OptimVariable lossAllReduce this will change, at this point is on the loss of one phase calculation
The code logic is roughly as follows:
- 1) Obtain the minimum position pos of “latest loss”
- 2) The learning rate beta = dir.f1[1] / numSearchStep;
- 3) According to the judgment of “the latest loss OF POS” and the last minimum value of last loss value, it is processed respectively:
- 3.1) If all losses are greater than last Loss value, then
- Multiply 1.0 / (numSearchStep*numSearchStep)
- 3.1.2) Set eta to 0; So this is the step size
- 3.1.3) Set the current Loss to Last Loss Value
- 3.2) If the losses[numSearchStep] is smaller than last Loss value
- Multiply numSearchStep
- 3.2.2) Set eta to smallest value pos, eta = beta * pos; This eta is the step size
- 3.2.3) Set current Loss to the minimum current Loss [numSearchStep]
- 3.3) or
- 3.3.1) Learning rate does not change
- 3.3.2) Set eta to smallest value pos, eta = beta * pos; This eta is the step size
- 3.3.3) Set current Loss to the minimum current Loss [pos]
- 3.1) If all losses are greater than last Loss value, then
- 4) Modify the Hessian matrix
- 5) with the direction vector and step to update the coefficient vector curCoef f0. PlusScaleEqual (dir. F0, eta);
- 6) If the current Loss is smaller than min Loss, update min Loss with Current Loss
- 6.1) minCoef. F1 = curCoef. F1; Update minimum Loss
- 6.2) minCoef. F0. SetEqual (curCoef. F0); Update Coef corresponding to minimum Loss, that is, the coefficient required by linear model at last
In taking the step size here, I see no use of the Wolf-Powell rule, and Alink has made some kind of optimization. If you can see how wolf-Powell is used, please leave a comment. Thank you.
The code is as follows:
public class UpdateModel extends ComputeFunction {
@Override
public void calc(ComContext context) {
double[] losses = context.getObj(OptimVariable.lossAllReduce);
Tuple2<DenseVector, double[]> dir = context.getObj(OptimVariable.dir);
Tuple2<DenseVector, double[]> pseGrad = context.getObj(OptimVariable.pseGrad);
Tuple2<DenseVector, Double> curCoef = context.getObj(OptimVariable.currentCoef);
Tuple2<DenseVector, Double> minCoef = context.getObj(OptimVariable.minCoef);
double lossChangeRatio = 1.0;
double[] lossCurve = context.getObj(OptimVariable.lossCurve);
for (int j = 0; j < losses.length; ++j) {
losses[j] /= dir.f1[0]; / / dir. F1 [0] is weight
}
int pos = -1;
//get the min value of losses, and remember the position.
for (int j = 0; j < losses.length; ++j) {
if (losses[j] < losses[0]) {
losses[0] = losses[j]; pos = j; }}// adaptive learningRate strategy
double beta = dir.f1[1] / numSearchStep;
double eta;
// The variables are as follows
losses = {double[5] @10001}
0 = 35.88248934857763
1 = 49.71490682314878
2 = 44.85050707229464
3 = 40.239701247437594
4 = 35.88248934857763
dir = {Tuple2@10002} "(-9.599-9.599-3.5515274823133662-3.5911942493220335,[4.0, 0.1])"
pseGrad = null
curCoef = {Tuple2@10003} "(0.001 0.0 0.0 0.0, 1.7976931348623157 e308)"
minCoef = {Tuple2@10004} "(0.001 0.0 0.0 0.0, 1.7976931348623157 e308)"
lossChangeRatio = 1.0
lossCurve = {double[100] @10005}
pos = 4
beta = 0.025
curCoef.f1 = {Double@10006} 1.7976931348623157 e308
if (pos == -1) {
/** * if all losses larger than last loss value. we'll do the below things: * 1. Reduce learning rate by multiply 1.0 / (numSearchStep*numSearchStep) equals last loss value. */
eta = 0;
dir.f1[1] * =1.0 / (numSearchStep * numSearchStep); / / vector
curCoef.f1 = losses[0]; / / the minimum loss
} else if (pos == numSearchStep) {
/** * if losses[numSearchStep] smaller than last loss value. we'll do the below things: * 1. enlarge learning rate by multiply numSearchStep. * 2. set eta with the smallest value pos. * 3. set current loss equals smallest loss value. */
eta = beta * pos;
dir.f1[1] *= numSearchStep;
dir.f1[1] = Math.min(dir.f1[1], numSearchStep); / / vector
lossChangeRatio = Math.abs((curCoef.f1 - losses[pos]) / curCoef.f1);
curCoef.f1 = losses[numSearchStep]; / / the minimum loss
// The current value
numSearchStep = 4Line search parameter, which define the search value num of one step. LossChangeRatio =1.0
pos = 4
eta = 0.1
curCoef.f1 = {Double@10049} 35.88248934857763
dir.f1[1] = 0.4
} else {
/** * else : * 1. learning rate not changed. * 2. set eta with the smallest value pos. * 3. set current loss equals smallest loss value. */
eta = beta * pos;
lossChangeRatio = Math.abs((curCoef.f1 - losses[pos]) / curCoef.f1);
curCoef.f1 = losses[pos]; / / the minimum loss
}
/* update loss value in loss curve at this step */
lossCurve[context.getStepNo() - 1] = curCoef.f1;
lossCurve = {double[100] @9998}
0 = 35.88248934857763
1 = Infinity
if (method.equals(OptimMethod.OWLQN)) {... }else if (method.equals(OptimMethod.LBFGS)) {
Tuple2<DenseVector[], DenseVector[]> sKyK = context.getObj(OptimVariable.sKyK);
int size = dir.f0.size();
int k = context.getStepNo() - 1;
DenseVector[] sK = sKyK.f0;
for (int s = 0; s < size; ++s) {
// Modify the matrix
sK[k % OptimVariable.numCorrections].set(s, dir.f0.get(s) * (-eta));
}
curCoef.f0.plusScaleEqual(dir.f0, -eta); // Update the coefficient vector with the direction vector and step size
sKyK = {Tuple2@10043} "([0.9599000000000001 0.9599000000000001 0.35515274823133663 0.3591194249322034, 0.0 0.0 0.0 0.0, 0.0 0.0 0.0, 0.0 0.0 0.0 0.0 0.0 0.0 0.0, 0.0, 0.0, 0.0, 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0, 0.0 0.0 0.0 0.0], [0.0, 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0, 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0, and 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0])"
f0 = {DenseVector[10] @10044}
0 = {DenseVector@10074} "0.9599000000000001 0.9599000000000001 0.35515274823133663 0.3591194249322034"
}
/** * if current loss is smaller than min loss, then update the min loss and min coefficient by current. */
if (curCoef.f1 < minCoef.f1) {
minCoef.f1 = curCoef.f1; / / the minimum loss
minCoef.f0.setEqual(curCoef.f0); // Coef corresponding to minimum Loss, that is, the coefficient required by the last linear model
}
curCoef = {Tuple2@9996} "(0.9609000000000001 0.9599000000000001 0.35515274823133663 0.3591194249322034, 35.88248934857763)"
f0 = {DenseVector@10059} "0.9609000000000001 0.9599000000000001 0.35515274823133663 0.3591194249322034"
f1 = {Double@10048} 35.88248934857763
minCoef = {Tuple2@9997} "(0.9609000000000001 0.9599000000000001 0.35515274823133663 0.3591194249322034, 35.88248934857763)"
f0 = {DenseVector@10059} "0.9609000000000001 0.9599000000000001 0.35515274823133663 0.3591194249322034"
f1 = {Double@10048} 35.88248934857763
// judge the convergence of iteration.filter(dir, curCoef, minCoef, context, lossChangeRatio); }}Copy the code
Filter to determine whether convergence, the print log inside very clear description of the function logic.
/** * judge the convergence of iteration. */
public void filter(Tuple2<DenseVector, double[]> grad,
Tuple2<DenseVector, Double> c,
Tuple2<DenseVector, Double> m,
ComContext context,
double lossChangeRatio) {
double epsilon = params.get(HasEpsilonDv0000001.EPSILON);
int maxIter = params.get(HasMaxIterDefaultAs100.MAX_ITER);
double gradNorm = ((Tuple2<DenseVector, double[]>)context.getObj(gradName)).f0.normL2();
if (c.f1 < epsilon || gradNorm < epsilon) {
printLog(" method converged at step : ", c.f1, m.f1, grad.f1[1], gradNorm, context, lossChangeRatio);
grad.f1[0] = -1.0;
} else if (context.getStepNo() > maxIter - 1) {
printLog(" method stop at max step : ", c.f1, m.f1, grad.f1[1], gradNorm, context, lossChangeRatio);
grad.f1[0] = -1.0;
} else if (grad.f1[1] < EPS) {
printLog(" learning rate is too small, method stops at step : ", c.f1, m.f1, grad.f1[1], gradNorm,
context, lossChangeRatio);
grad.f1[0] = -1.0;
} else if (lossChangeRatio < epsilon && gradNorm < Math.sqrt(epsilon)) {
printLog(" loss change ratio is too small, method stops at step : ", c.f1, m.f1, grad.f1[1], gradNorm,
context, lossChangeRatio);
grad.f1[0] = -1.0;
} else {
printLog(" method continue at step : ", c.f1, m.f1, grad.f1[1], gradNorm, context, lossChangeRatio); }}Copy the code
3.7 OutputModel
CloseWith (new OutputModel()) this part is the end of each iteration, the temporary OutputModel, the data is converted to Flink common Row type, Transfer the state to model rows.
public class OutputModel extends CompleteResultFunction {
@Override
public List <Row> calc(ComContext context) {
// get the coefficient of min loss.
Tuple2 <DenseVector, double[]> minCoef = context.getObj(OptimVariable.minCoef);
double[] lossCurve = context.getObj(OptimVariable.lossCurve);
int effectiveSize = lossCurve.length;
for (int i = 0; i < lossCurve.length; ++i) {
if (Double.isInfinite(lossCurve[i])) {
effectiveSize = i;
break; }}double[] effectiveCurve = new double[effectiveSize];
System.arraycopy(lossCurve, 0, effectiveCurve, 0, effectiveSize);
Params params = new Params();
params.set(ModelParamName.COEF, minCoef.f0);// Point is, minCoef is what we really need
params.set(ModelParamName.LOSS_CURVE, effectiveCurve);
List <Row> model = new ArrayList <>(1);
model.add(Row.of(params.toJson()));
returnmodel; }}Copy the code
0x04 Preparing Model Metadata
I set the parallelism here to 1.
// Prepare the meta info of linear model.
DataSet<Params> meta = labelInfo.f0
.mapPartition(new CreateMeta(modelName, linearModelType, isRegProc, params))
.setParallelism(1);
Copy the code
Specific code
public static class CreateMeta implements MapPartitionFunction<Object.Params> {
@Override
public void mapPartition(Iterable<Object> rows, Collector<Params> metas) throws Exception {
Object[] labels = null;
if (!this.isRegProc) {
labels = orderLabels(rows);
}
Params meta = new Params();
meta.set(ModelParamName.MODEL_NAME, this.modelName);
meta.set(ModelParamName.LINEAR_MODEL_TYPE, this.modelType);
meta.set(ModelParamName.LABEL_VALUES, labels);
meta.set(ModelParamName.HAS_INTERCEPT_ITEM, this.hasInterceptItem); meta.set(ModelParamName.VECTOR_COL_NAME, vectorColName); meta.set(LinearTrainParams.LABEL_COL, labelName); metas.collect(meta); }}/ / variable is
meta = {Params@9667} "Params {hasInterceptItem=true, vectorColName=null, modelName="Linear Regression", labelValues=null, labelCol="label", linearModelType="LinearReg"}"
Copy the code
0x05 Model building
When the iteration cycle is over, Alink builds the model based on the Coef data.
/** * build the linear model rows, the format to be output. */
public static class BuildModelFromCoefs extends AbstractRichFunction implements
@Override
public void mapPartition(可迭代<Tuple2<DenseVector.double[] > >可迭代.Collector<Row> collector) throws Exception {
for (Tuple2<DenseVector, double[]> coefVector : iterable) {
LinearModelData modelData = buildLinearModelData(meta,
featureNames,
labelType,
meanVar,
hasIntercept,
standardization,
coefVector);
new LinearModelDataConverter(this.labelType).save(modelData, collector); }}}Copy the code
So the coef of f(x)=w^Tx+b is w, b. It’s the final calculation.
modelData = {LinearModelData@10584}
featureNames = {String[3] @9787}
featureTypes = null
vectorColName = null
coefVector = {DenseVector@10485} "3.938937407856857 4.799499941426075 0.8929571907809862 1.078169576770847"
coefVectors = null
vectorSize = 3
modelName = "Linear Regression"
labelName = null
labelValues = null
linearModelType = {LinearModelType@4674} "LinearReg"
hasInterceptItem = true
lossCurve = {double[12] @10593}
0 = 35.88248934857763
1 = 12.807366842002144
2 = 0.5228366663917704
3 = 0.031112070740366038
4 = 0.01098914933042993
5 = 0.009765757443537283
6 = 0.008750523231785415
7 = 0.004210085397869248
8 = 0.0039042232755530704
9 = 0.0038821509860327537
10 = 0.003882042680010676
11 = 0.0038820422536391033
labelType = {FractionalTypeInfo@9790} "Double"
Copy the code
0x06 Use model prediction
The prediction uses a LinearModelMapper, and some of its internal variables are printed as follows, so you can see the model data.
this = {LinearModelMapper@10704}
vectorColIndex = -1
model = {LinearModelData@10597}
featureNames = {String[3] @10632}
featureTypes = null
vectorColName = null
coefVector = {DenseVector@10612} "3.938937407856857 4.799499941426075 0.8929571907809862 1.078169576770847"
coefVectors = null
vectorSize = 0
modelName = "Linear Regression"
labelName = null
labelValues = {Object[0] @10613}
linearModelType = {LinearModelType@10611} "LinearReg"
hasInterceptItem = true
lossCurve = null
labelType = null
Copy the code
Specific forecast is in LinearModelMapper. Predict, specific as follows:
- Corresponding to raw input
Row of (" $3 $1-0. 1:7. 0 0 2:9. 0 ", "1.0 7.0 9.0, 1.0, 7.0, 9.0, 16.8).
, - through
FeatureLabelUtil.getFeatureVector
After processing, - The resulting quad is zero
"1.0 1.0 7.0 9.0"
Where the first 1.0 is passedAVector. Set (0, 1.0);
A specially set fixed value. For example, if the model is f(x) = Ax + b, the fixed value 1.0 is the initialization value of B, and the optimization process leads to B. So you still need a 1.0 to make a prediction. - The model coefficient is:
"3.938937407856857 4.799499941426075 0.8929571907809862 1.078169576770847"
。 - The dot product of the quad and the model coefficients is
DotValue = 16.814789059973744
.
So you can see how the model coefficients are used.
public Object predict(Vector vector) throws Exception {
double dotValue = MatVecOp.dot(vector, model.coefVector);
switch (model.linearModelType) {
case LR:
case SVM:
case Perceptron:
return dotValue >= 0 ? model.labelValues[0] : model.labelValues[1];
case LinearReg:
case SVR:
return dotValue;
}
}
vector = {DenseVector@10610} "1.0 1.0 7.0 9.0"
data = {double[4] @10619}
0 = 1.0
1 = 1.0
2 = 7.0
3 = 9.0
model.coefVector = {DenseVector@10612} "3.938937407856857 4.799499941426075 0.8929571907809862 1.078169576770847"
data = {double[4] @10620}
0 = -3.938937407856857
1 = 4.799499941426075
2 = 0.8929571907809862
3 = 1.078169576770847
Copy the code
0 XFF reference
www.mamicode.com/info-detail,…).
Machine learning Algorithm implementation analysis — L-BFGS algorithm of LiBLBFGS
BFGS & L-BFGS in depth machine learning series
Derivation of Quasi-Newton method formula and Python Code Implementation (PART 1)
Easy to understand — Taylor’s expansion
Taylor Expansion – Taylor Expansion
From Newton interpolation to Taylor’s formula
How to better understand and remember Taylor’s expansion?
Optimization Algorithm — Newton Method
Optimization algorithm – L-BFGS algorithm of quasi-Newton method
L-BFGS algorithm
Distributed Machine Learning Algorithms, Theory and Practice liu Tieyan
LogisticRegressionWithLBFGS – logistic regression
Logistic regression optimization method -L-BFGS
Machine learning and Operations Research Optimization (III) from Newton method to L-BFGS
A popular explanation of Newton’s convex optimization in machine learning
In-depth Machine Learning Series 17-BFGS & L-BFGS
Basic series of optimization methods – Precise one-dimensional search techniques
Basic series of optimization methods – Imprecise one-dimensional search techniques
Explain the Armijo-Goldstein criterion and Wolfe-Powell criterion in the search of imprecise lines in “human language”
www.zhihu.com/question/36…
One-dimensional search algorithm and its implementation
Numerical optimization | notes (1) — introduction, line search: step length selection conditions
zhuanlan.zhihu.com/p/32821110
Line search (a) : step length selection
Optimization Problem — One-dimensional Search (II)
CRF L-BFGS Line Search principle and code analysis
Step size and learning rate
Machine learning optimization algorithm L-BFGS and its distributed implementation
L-bfgs algorithm (default optimization algorithm of logistic regression)
The Principle and Practice of Linear Regression (Newton method)
Linear Regression, Gradient Descent
L – BFGS algorithm
Parallel logistic regression
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.