Alink: Data preprocessing for linear regression implementation
0 x00 the
Alink is a new generation of machine learning algorithm platform developed by Alibaba based on real-time computing engine Flink. It is the first machine learning platform in the industry that supports both batch algorithm and streaming algorithm. This article and the following will show you how linear regression can be implemented in Alink, hopefully as a Roadmap for looking at linear regression code.
Since Alink’s public information is too little, the following is my own speculation, and there will definitely be omissions. I hope you can point out that I will update at any time.
This series has ten articles, welcome to give advice
0 x01 concept
1.1 Linear regression
Linear regression is a statistical analysis method that uses regression analysis in mathematical statistics to determine the interdependent quantitative relationship between two or more variables, which is widely used. It is expressed as y = W ‘x+e, where the error follows a normal distribution with a mean of 0.
In linear regression, there is a linear correlation between target value and feature. So let’s say that this equation is a linear equation, a multivariate first order equation.
Basic form: Given an example described by D attributes, the linear model attempts to learn a function that predicts by a linear combination of attributes, namely:
Where W is the parameter, also known as the weight, which can be understood as x1, x2… And xD impact on F (x).
The general form is:
If we use this formula to predict f(x), we know what x is, but we don’t know what w and b are, and once we solve for w and b, the model is established. We can make predictions based on this formula.
So how do we solve for the optimal values of W and B based on the training data? The key is to measure the difference between f and y. This involves another concept: Loss Function.
1.2 Optimization Model
If I have a model f(x), how can I tell if the model is good? This qualitative judgment can be measured by a value known as the empirical error risk, which is the sum E(x) of model F’s errors on all training samples.
We train the model by minimizing the experience loss on the training set. In other words, by adjusting the parameter W of F, the empirical error risk E(x) decreases continuously, and when it finally reaches the minimum value, we obtain an “optimal” model.
However, according to the above definition, E(x) is the sum of a set of indicative functions, and therefore is a discontinuous and non-differentiable function, which is not easy to optimize. To solve this problem, people put forward the concept of “loss function” **. A loss function is one that has a certain relationship with the error function (such as the upper bound of the error function), but has better mathematical properties (such as continuity, differentiability, convexity, etc.) and is easier to optimize. So we can optimize the loss function.
If the loss function is continuously differentiable, we can use gradient descent algorithm and other first-order algorithms, but also Newton method, quasi-Newton method and other second-order algorithms. When the optimization algorithm converges, we get a good model. If the loss function is convex, we can get the optimal model.
Typical optimization methods:
A first order algorithm | The second order algorithm | |
---|---|---|
Deterministic algorithm | Gradient descent projection subgradient descent near-end gradient descent Frank-Wolfe algorithm Nesterov acceleration algorithm Coordinate descent method dual coordinate ascent method | Newton’s method, quasi Newtonian method |
Random algorithm | Random gradient descent random coordinate descent random dual coordinate ascent random variance reduction gradient method | Random quasi – Newton method |
So we know that the method to optimize the LinearRegression model F must be: determine the loss function, use x and y as input training to get the minimum value of the loss function, and then determine the parameter W of F. The process is roughly as follows:
-
Processing the input, converting x and y into the format that the algorithm needs.
-
Find an appropriate prediction function, generally expressed as h function, which is the classification function we need to find, and it is used to predict the judgment result of input data.
-
Construct a Cost function (loss function) that represents the deviation between the predicted output (h) and the training data class (y), which can be the difference between the two (H-Y) or some other form. Considering the “loss” of all training data comprehensively, the sum or average of Cost is denoted as **J(θ)** function, representing the deviation between the predicted value of all training data and the actual category.
-
Obviously, the smaller the value of the loss function J(θ) is, the more accurate the prediction function is (that is, the more accurate the h function is), so all you need to do in this step is find the minimum value of the J(θ) function. Notice, the loss function is a function of theta! In other words, θ is no longer a parameter of the loss function, but an independent variable of the loss function!
-
Prepare the model metadata and build the model.
1.3 Loss function & objective function
To summarize:
- Loss function: calculates the error of a sample;
- Cost function: is the average of all sample errors in the entire training set, often mixed with the loss function;
- Objective function: cost function + regularization term;
To elaborate:
Suppose we use f(X) to fit the true value Y. The f(X) of this output may or may not be the same as the true value Y, and to show how well we fit, we use a function to measure the degree of fit. This function is called loss function, or cost function.
The loss function is used to measure the operation of the algorithm and measure the inconsistency between the predicted value and the real value of the model. It is a non-negative real value function, which is usually expressed by L(Y,f(x)). The smaller the loss function, the better the robustness of the model. Loss function is the core part of empirical risk function.
Objective function is a related but broader concept. For objective function, minimization under constraint conditions is loss function.
This is because F (x) may overlearn historical data, causing it to perform poorly when it actually predicts, a condition known as over-fitting. The resulting function would be too complicated. So we have to minimize not only empirical risk, but also structural risk. At this point, a function J(x) is defined, which is specifically used to measure the complexity of the model, also known as regularization in machine learning. L1 and L2 norm are commonly used.
The essence of L1 regularization is to add a priori knowledge that model parameters obey zero-mean Laplacian distribution to the model.
The essence of L2 regularization is to add the prior knowledge that “model parameters obey zero-mean normal distribution” to the model.
L1 regularization increases the ownership weight of the sum of the absolute values of the W parameters and forces more w to be zero, i.e., to become sparse (L2, because its derivatives also approach 0, is not as fast as L1). L1 regularization is introduced to fulfill the glorious mission of automatic feature selection. It will learn to remove useless features, that is, reset the weights corresponding to these features to 0.
L2 regularizes by adding the sum of the squares of the property weight w parameters, forcing all w to go as close to zero as possible but not to zero (derivative of L2 goes to zero). Because in the case of fitting without L2 regularization, the fitting function needs to worry about every point, and the fitting function formed in the end fluctuates greatly. In some very small intervals, the value of the function changes dramatically, that is, some W values are very large. For this reason, the addition of L2 regularization penalizes the tendency of increasing weights.
At this point we can say that our final optimization function is min(L(Y, f(x) + J(x)), namely optimization empirical risk and structural risk, and this function is called the objective function.
In regression problems, the cost function of square error (least square linear regression) is commonly used to solve the optimal solution through the objective function. The loss function is the squared loss function.
1.4 Least square method
Mean square error is the most commonly used performance measure in regression tasks, so it can be minimized. The method of model solving based on the minimization of mean square error is called “least square method”. In linear regression, the least square method is to find a line that minimizes the sum of Euclidean distances from all samples to the line. So the loss function in linear regression is the squared loss function.
With these basic concepts in mind, let’s start analyzing the Alink code.
0x02 Sample code
First, we give an example of linear regression.
public class LinearRegressionExample {
static Row[] vecrows = new Row[] {
Row.of("$3 $1-0. 0 1:7. 0 2:9. 0"."1.0 7.0 9.0".1.0.7.0.9.0.16.8),
Row.of("$3 $1-0. 0 1:3. 0 2:3. 0"."1.0 3.0 3.0".1.0.3.0.3.0.6.7),
Row.of("$3 $1-0. 0 1:2. 0 2:4. 0"."1.0 2.0 4.0".1.0.2.0.4.0.6.9),
Row.of("$3 $1-0. 0 1:3. 0 2:4. 0"."1.0 3.0 4.0".1.0.3.0.4.0.8.0)};static String[] veccolNames = new String[] {"svec"."vec"."f0"."f1"."f2"."label"};
static BatchOperator vecdata = new MemSourceBatchOp(Arrays.asList(vecrows), veccolNames);
static StreamOperator svecdata = new MemSourceStreamOp(Arrays.asList(vecrows), veccolNames);
public static void main(String[] args) throws Exception {
String[] xVars = new String[] {"f0"."f1"."f2"};
String yVar = "label";
String vec = "vec";
String svec = "svec";
LinearRegression linear = new LinearRegression()
.setLabelCol(yVar) // We have all the variables set here, and we will use them later
.setFeatureCols(xVars)
.setPredictionCol("linpred");
Pipeline pl = new Pipeline().add(linear);
PipelineModel model = pl.fit(vecdata);
BatchOperator result = model.transform(vecdata).select(
new String[] {"label"."linpred"}); List<Row> data = result.collect(); }}Copy the code
The output is
svec|vec|f0|f1|f2|label|linpred
----|---|--|--|--|-----|-------
$3$0:1.0 1:7.0 2:9.0|1.0 7.0 9.0|1.0000|7.0000|9.0000|16.8000|16.8148
$3$0:1.0 1:3.0 2:4.0|1.0 3.0 4.0|1.0000|3.0000|4.0000|8.0000|7.8521
$3$0:1.0 1:3.0 2:3.0|1.0 3.0 3.0|1.0000|3.0000|3.0000|6.7000|6.7739
$3$0:1.0 1:2.0 2:4.0|1.0 2.0 4.0|1.0000|2.0000|4.0000|6.9000|6.959
Copy the code
According to the above, we can know that square error (least square linear regression) cost function is commonly used in regression problems to solve the optimal solution by optimizing the objective function. The loss function is the squared loss function.
Corresponding to Alink, the optimization function or optimizer is l-BFGS algorithm of quasi-Newton method, the objective function is UnaryLossObjFunc, and the loss function is SquareLossFunc. The general logic of linear regression training is LinearRegTrainBatchOp. So let’s go through them one by one.
0x03 Overview
LinearRegression LinearRegTrainBatchOp training use, and LinearRegTrainBatchOp base class is BaseLinearModelTrainBatchOp. So we see BaseLinearModelTrainBatchOp.
public class LinearRegression extends Trainer <LinearRegression.LinearRegressionModel> implements LinearRegTrainParams <LinearRegression>, LinearRegPredictParams <LinearRegression> {
@Override
protected BatchOperator train(BatchOperator in) {
return new LinearRegTrainBatchOp(this.getParams()).linkFrom(in); }}Copy the code
BaseLinearModelTrainBatchOp linkFrom code is as follows, comments are given in clear logic:
Is roughly:
- Obtain algorithm parameters and label information;
- Tuple3 format
,>
;
- Get statistical information, such as vector size, mean and variance;
- Standardized and interpolated training data;
- L-bfgs algorithm was used to optimize the model by minimizing the loss function.
- Prepare model metadata;
- Model building;
public T linkFrom(BatchOperator
... inputs) { BatchOperator<? > in = checkAndGetFirst(inputs);// Get parameters of this algorithm.
Params params = getParams();
// Get type of processing: regression or not
boolean isRegProc = getIsRegProc(params, linearModelType, modelName);
// Get label info : including label values and label type.
Tuple2<DataSet<Object>, TypeInformation> labelInfo = getLabelInfo(in, params, isRegProc);
// Transform data to Tuple3 format.//weight, label, feature vector.
DataSet<Tuple3<Double, Double, Vector>> initData = transform(in, params, labelInfo.f0, isRegProc);
// Get statistics variables : including vector size, mean and variance of train data.
Tuple2<DataSet<Integer>, DataSet<DenseVector[]>>
statInfo = getStatInfo(initData, params.get(LinearTrainParams.STANDARDIZATION));
// Do standardization and interception to train data.
DataSet<Tuple3<Double, Double, Vector>> trainData = preProcess(initData, params, statInfo.f1);
// Solve the optimization problem.
DataSet<Tuple2<DenseVector, double[]>> coefVectorSet = optimize(params, statInfo.f0,
trainData, linearModelType, MLEnvironmentFactory.get(getMLEnvironmentId()));
// Prepare the meta info of linear model.
DataSet<Params> meta = labelInfo.f0
.mapPartition(new CreateMeta(modelName, linearModelType, isRegProc, params))
.setParallelism(1);
// Build linear model rows, the format to be output.
DataSet<Row> modelRows;
String[] featureColTypes = getFeatureTypes(in, params.get(LinearTrainParams.FEATURE_COLS));
modelRows = coefVectorSet
.mapPartition(new BuildModelFromCoefs(labelInfo.f1,
params.get(LinearTrainParams.FEATURE_COLS),
params.get(LinearTrainParams.STANDARDIZATION),
params.get(LinearTrainParams.WITH_INTERCEPT), featureColTypes))
.withBroadcastSet(meta, META)
.withBroadcastSet(statInfo.f1, MEAN_VAR)
.setParallelism(1);
// Convert the model rows to table.
this.setOutput(modelRows, new LinearModelDataConverter(labelInfo.f1).getModelSchema());
return (T)this;
}
Copy the code
We will refine this logic later.
0x04 Basic Functions
Let’s first introduce the related basic functions and related concepts, such as loss function, objective function, gradient, etc.
4.1 Loss function
The loss function involves several concepts.
4.1.1 Derivatives and Partial derivatives
A derivative is also a function, the rate of change of a function relative to its position. The derivative represents the ratio of the change in the value of the function to the change in the independent variable as the change in the independent variable approaches infinitesimal. The geometric meaning is the tangent of this point. The physical meaning is the (instantaneous) rate of change at that time.
The derivative is the rate of change of the function y equals f of x along the positive x axis at some point. So intuitively, at some point on the X-axis, if f prime of x is greater than 0, that means that the value of f of x tends to increase at that point along the X-axis; If f prime of x is less than 0, it means that the value of f of x tends to decrease at x along the positive X-axis.
The unary derivative represents the ratio (rate of change, slope) of the unary function f(x) and the independent variable X near a certain point.
What if it’s a function of several variables? Is the partial derivative. The partial derivative is the derivative of a function of several variables when it “degenerated” into a function of one variable, where “degenerated” means that the value of the other variables is fixed, only one variable is retained, and each variable is retained in turn, then the n-element function has N partial derivatives. The partial derivative is the derivative of the function along the axis of the independent variable at each position (slope of the tangent line). The partial derivative of a binary function represents the ratio (rate of change) of the function F(x,y) and the independent variable x (or y) near a certain point.
4.1.2 Directional derivatives
In the definition of derivative and partial derivative, the rate of change of function is discussed along the positive axis. Then, when we discuss the rate of change of a function in any direction, the definition of directional derivative is also derived, that is, the derivative value of a point in a certain approaching direction.
The directional derivative is the inner product of the partial derivative composition vector and the directional vector. The essence of directional derivative is a numerical value, which is simply defined as: the rate of change of a function along a specified direction.
4.1.3 Hessian matrix
In the problem of solving functions of one variable, we can happily use Newton’s method to find stagnation points. X is usually not a real number, but a vector. Therefore, when Newton root method is applied to machine learning, X is a vector, y is also a vector, and the derivative of x is a matrix, that is, a Hessian matrix.
In mathematics, a Hessian matrix (Hessian matrix or Hessian matrix) is a square matrix consisting of the second partial derivatives of a real-valued function of a vector. The second derivative of a function of several variables is a Hessian matrix.
4.1.4 square loss function in Alink
As I mentioned, the loss function in linear regression is the squared loss function. Let’s look at the implementation. Loss and Derivative will be invoked in subsequent implementation. Details will be discussed later.
UnaryLossFunc is an interface that represents a unary loss function. Each function it defines has two inputs (eta and y), and Alink takes the difference between these two inputs as the unary variable of the loss function. The basic API is to take the loss, take the derivative, take the second derivative.
public interface UnaryLossFunc extends Serializable {
// Loss function.
double loss(double eta, double y);
// The derivative of loss function.
double derivative(double eta, double y);
// The second derivative of the loss function.
double secondDerivative(double eta, double y);
}
Copy the code
The specific realization of the squared loss function is as follows:
public class SquareLossFunc implements UnaryLossFunc {
@Override
public double loss(double eta, double y) {
return 0.5 * (eta - y) * (eta - y);
}
@Override
public double derivative(double eta, double y) {
return eta - y;
}
@Override
public double secondDerivative(double eta, double y) {
return 1; }}Copy the code
4.2 Objective Function
The concept involved here is gradient, gradient descent.
2 the gradient
For model optimization, we choose the optimal θ so that f(x) is closest to the true value. This problem is then transformed into finding the optimal θ so that the loss function J(θ) is minimized. So how to solve the problem after this transformation? This, in turn, refers to the concept of Radient Descent.
So we’re going to review the gradient first.
- A vector is defined as a quantity with direction and magnitude.
- A gradient is a vector that has a direction and a magnitude; It is defined as: the partial derivatives of a multivariate function with respect to its independent variables are taken respectively, and the vector formed by these partial derivatives is the gradient of the function.
- The gradient is the maximum directional derivative of the function at a point, and the function has the maximum rate of change along the gradient.
- The first meaning of gradient is ** “maximum directional derivative” **.
- The gradient direction at the current position is the direction in which the derivative of the function at this position is the largest, and the direction in which the function value rises the fastest, and the opposite direction is the direction in which the function value drops the fastest.
- The geometric meaning of a gradient is that the rate of change is greatest along the direction of the vector.
4.2.2 Gradient descent method
Gradient descent method (GDA) is a first-order optimization algorithm. Its core idea is: to find the local minimum of a function fastest, we must conduct a specified step “iterative” search along the opposite direction (descent) of the corresponding “gradient” (or approximate gradient) at the current point of the function. Moving in the opposite direction of the gradient (slope) is called gradient descent.
Since at a certain point in the variable space, the function has the maximum change rate along the gradient direction, it is natural to reduce the function value along the negative gradient direction when optimizing the objective function, so as to achieve our optimization goal.
Descent in gradient descent means that the unknown of the function moves in the direction of the gradient. What is the direction of the gradient? If I plug this point into the gradient function, and it’s positive, then I’m going to make this point a little smaller, and I’m going to make the gradient a little smaller; When this point is negative in the gradient function, you increase the value of this point.
How do I decrease the value of the function along the negative gradient? Since the gradient is a set of partial derivatives, and both the gradient and the partial derivative are vectors, the reference vector algorithm requires that we decrease the pair strain value on each variable axis.
Gradient descent is the process by which all partial derivatives of a gradient descend to the lowest point. It goes down to the lowest point, and then the optimal solution for each of the unknowns (or dimensions) is found, so it’s an algorithm for optimizing the function.
The “least square method” and the “gradient descent method”, the former for “search with the minimum error”, the latter for “search with the fastest speed”, are often used together. The parameter tuning of the least square method becomes the extremum problem of the binary function, that is, the gradient descent method can be applied.
In the least squares function, the condition you have is some sample points and the result of the sample points, which are the matrix X and the lable value y of each X sample. X is the matrix, y is the vector. So what we need to know is that in gradient descent the partial derivative is not x and y, but the parameter w of x.
4.2.3 Objective function in Alink
The base class for objective functions is OptimObjFunc, which provides apis such as computing gradients, losses, and Hessian matrices, and updating gradients and Hessian matrices based on sample points. Several of its derived classes are shown below, and the scope of use can be seen in the comments.
We can see regularization (L1, L2 norm), which is the module that increases compared to the loss function.
public abstract class OptimObjFunc implements Serializable {
protected final double l1;
protected final double l2; Regularization (L1, L2 norm).
protectedParams params; . }// Unary loss object function.
public class UnaryLossObjFunc extends OptimObjFunc
// The OptimObjFunc for multilayer perceptron.
public class AnnObjFunc extends OptimObjFunc
// Accelerated failure time Regression object function.
public class AftRegObjFunc extends OptimObjFunc
// Softmax object function.
public class SoftmaxObjFunc extends OptimObjFunc
Copy the code
For linear model, the BaseLinearModelTrainBatchOp will depending on the type of model to generate the target function, you can see in the generated target function at the same time, also set up different loss function, accordingly SquareLossFunc which is we mentioned earlier.
public static OptimObjFunc getObjFunction(LinearModelType modelType, Params params) {
OptimObjFunc objFunc;
// For different model type, we must set corresponding loss object function.
switch (modelType) {
case LinearReg:
// Here we are!
objFunc = new UnaryLossObjFunc(new SquareLossFunc(), params);
break;
case SVR:
double svrTau = params.get(LinearSvrTrainParams.TAU);
objFunc = new UnaryLossObjFunc(new SvrLossFunc(svrTau), params);
break;
case LR:
objFunc = new UnaryLossObjFunc(new LogLossFunc(), params);
break;
case SVM:
objFunc = new UnaryLossObjFunc(new SmoothHingeLossFunc(), params);
break;
case Perceptron:
objFunc = new UnaryLossObjFunc(new PerceptronLossFunc(), params);
break;
case AFT:
objFunc = new AftRegObjFunc(params);
break;
default:
throw new RuntimeException("Not implemented yet!");
}
return objFunc;
}
Copy the code
4.2.4 Unary objective function in Alink
Unary objective function is the objective function used in linear regression, and there is only one new variable: unaryLossFunc. It’s the unary loss function.
/** * Unary loss object function. */
public class UnaryLossObjFunc extends OptimObjFunc {
private UnaryLossFunc unaryLossFunc;
}
Copy the code
Unary objective function provides many functions, the main ones we use here are:
- CalcGradient: Calculates a gradient from a set of sample points, which is integrated from the base class OptimObjFunc.
- UpdateGradient: Updates the gradient based on a sample point;
- CalcSearchValues: Calculates losses for linear searches;
4.2.4.1 Calculate gradient according to a set of sampling points
For this article, what is updated here is the gradient of the loss function.
Again, the loss function is used to measure the degree of fitting, so as to evaluate the quality of model fitting, denoted as J(θ). Notice, the loss function is a function of theta! In other words, θ is no longer a parameter of the loss function, but an independent variable of the loss function!
When we calculate the loss, we put the characteristic xi in each sample and the corresponding true value yi of the target variable into the loss function. At this time, only θ is unknown in the loss function.
The gradient of the loss function is the partial derivative of θ I. Since the loss function is a function of θ, the gradient vector obtained is also different with the value of θ. To use the metaphor of “going down the mountain”, different values of θ are equivalent to different positions on the mountain, and each position will calculate a gradient vector ▽J(θ).
L1, L2 is the regularization l1, L2 norm mentioned earlier.
/**
* Calculate gradient by a set of samples.
*
* @param labelVectors train data.
* @param coefVector coefficient of current time.
* @param grad gradient.
* @return weight sum
*/
public double calcGradient(Iterable
> labelVectors, DenseVector coefVector, DenseVector grad)
{
double weightSum = 0.0;
for (int i = 0; i < grad.size(); i++) {
grad.set(i, 0.0);
}
// Compute the gradient one by one for the input sample set labelVectors
for (Tuple3<Double, Double, Vector> labelVector : labelVectors) {
if (labelVector.f2 instanceof SparseVector) {
((SparseVector)(labelVector.f2)).setSize(coefVector.size());
}
// Take this sample as an example
labelVector = {Tuple3@9895} "(1.0, 16.8, 1.0, 1.0, 1.4657097546055162, 1.4770978917519928)"
f0 = {Double@9903} 1.0
f1 = {Double@9904} 16.8
f2 = {DenseVector@9905} "1.0 1.0 1.4657097546055162 1.4770978917519928"
weightSum += labelVector.f0; // labelvector. f0 is the weight
updateGradient(labelVector, coefVector, grad);
}
if (weightSum > 0.0) {
grad.scaleEqual(1.0 / weightSum);
}
// l2 regularize
if (0.0! =this.l2) {
grad.plusScaleEqual(coefVector, this.l2 * 2);
}
// l1 regularize
if (0.0! =this.l1) {
double[] coefArray = coefVector.getData();
for (int i = 0; i < coefVector.size(); i++) {
grad.add(i, Math.signum(coefArray[i]) * this.l1); }}return weightSum;
}
Copy the code
4.2.4.2 Update gradients based on a sampling point
Here, labelvector. f0 is the weight, labelvector. f1 is y, labelvector. f2 is x-vec four-dimensional vector, and coefVector is w coefficient vector.
- GetEta is the dot product, the dot product of the x vector with the current w coefficient, which is the current y.
- labelVector.f0 * unaryLossFunc.derivative(eta, labelVector.f1); Is called SquareLossFunc. Derivative function to calculate the first derivative.
- updateGrad.plusScaleEqual(labelVector.f2, div); It’s just updating the gradient
public class UnaryLossObjFunc extends OptimObjFunc {
/**
* Update gradient by one sample.
*
* @param labelVector a sample of train data.
* @param coefVector coefficient of current time.
* @param updateGrad gradient need to update.
*/
@Override
protected void updateGradient(Tuple3<Double, Double, Vector> labelVector, DenseVector coefVector, DenseVector updateGrad) {
// Dot product is the current y
double eta = getEta(labelVector, coefVector);
// First derivative. LabelVector f0 is weight
double div = labelVector.f0 * unaryLossFunc.derivative(eta, labelVector.f1);
// The dot product has to be added. F2 is x -- vec, such as 1.0 1.0 1.4657097546055162 1.4770978917519928
updateGrad.plusScaleEqual(labelVector.f2, div);
}
private double getEta(Tuple3<Double, Double, Vector> labelVector, DenseVector coefVector) {
// The dot product is the dot product of the KTH feature vector on the node and the feature weight component in the ith iteration. The c-th term in the coefVector is the component of the feature weight vector in the i-th iteration on the c-th column node
returnMatVecOp.dot(labelVector.f2, coefVector); }}/**
* Plus with another vector scaled by "alpha".
*/
public void plusScaleEqual(Vector other, double alpha) {
if (other instanceof DenseVector) {
BLAS.axpy(alpha, (DenseVector) other, this);
} else {
BLAS.axpy(alpha, (SparseVector) other, this); }}Copy the code
4.3 Optimization Function
Alink provides a series of parallel optimization functions, such as GD, SGD, LBFGS, OWLQN, NEWTON method.
Its base class is Optimizer.
public abstract class Optimizer {
protected finalDataSet<? > objFuncSet;// Concrete objective function, calculate gradient and loss
protected final DataSet<Tuple3<Double, Double, Vector>> trainData; // Training data
protected final Params params; / / parameters
protected DataSet<Integer> coefDim; //dimension of features.
protected DataSet<DenseVector> coefVec = null; // Final coefficient w. }Copy the code
Linear regression mainly used LBFGS algorithm.
public class Lbfgs extends Optimizer
Copy the code
The call is as follows
public static DataSet<Tuple2<DenseVector, double[]>> optimize(.....) {
// Loss object function
DataSet<OptimObjFunc> objFunc = session.getExecutionEnvironment()
.fromElements(getObjFunction(modelType, params));
if (params.contains(LinearTrainParams.OPTIM_METHOD)) {
LinearTrainParams.OptimMethod method = params.get(LinearTrainParams.OPTIM_METHOD);
return OptimizerFactory.create(objFunc, trainData, coefficientDim, params, method).optimize();
} else if (params.get(HasL1.L_1) > 0) {
return new Owlqn(objFunc, trainData, coefficientDim, params).optimize();
} else {
// Our program will run here
return newLbfgs(objFunc, trainData, coefficientDim, params).optimize(); }}Copy the code
The basic optimization routines of machine learning are:
Prepare data ----> optimization function ----> objective function ----> loss functionCopy the code
And here we have theta
BaseLinearModelTrainBatchOp. LinkFrom (overall logic) -- -- -- -- -- > Lbfgs Optimizer (inheritance) -- -- -- -- > UnaryLossObjFunc OptimObjFunc (inheritance) -- -- -- -- > SquareLossFunc (inherit UnaryLossFunc)Copy the code
0x05 Data Preparation
Having looked at the underlying functionality, let’s go back to the linear regression population flow.
Conclusion BaseLinearModelTrainBatchOp. LinkFrom basic process is as follows: (finds certain media list for typographic support is not good, so use serial number).
First, and then gives an example input: a Row of (” $3 $1-0. 1:7. 0 0 2:9. 0 “, “1.0 7.0 9.0, 1.0, 7.0, 9.0, 16.8), here the back four corresponding column name is” f0 “, “f1”, “f2”, “label”.
- 1) Obtain label information, including label value and type. LabelInfo = getLabelInfo() There is a distinct operation here, so the weight is removed. Finally, the possible value range of label is 0,1, and the type is Double.
- 2) Transform the input to Tuple3
,>
. Specifically, the three features of the input “f0”, “F1 “, and” F2 “will be converted into a vector vec, which we will later call x-vec. The point is that the feature becomes a vector. So this triple can be thought of as < weight, y-value, x-vec>.
- 3) Use statInfo = getStatInfo() to obtain statistical variables, including vector size, mean, and variance. The process here is complicated.
- 3.1) with trainData. Map {return value. The f2; } to obtain the X-VEC in the training data.
- 3.2) Call StatisticShelper. summary to do x-VEc processing
- Call summarizer 3.2.1)
- 3.2.1.1) call mapPartition (new VectorSummarizerPartition (bCov))
- 3.2.1.1.1) call VectorSummarizerPartition mapPartition, its traversal list, each of the variables in the list of sv is x – vec. SRT = srt.visit(SV), which recalculates count, sum, squareSum, normL1.. , thus obtaining these statistics for each column of the input in this Partiton.
- 3.2.1.2) calls to reduce (VectorSummarizerUtil. Merge (value1, value2)) to merge the result of each partition.
- 3.2.1.1) call mapPartition (new VectorSummarizerPartition (bCov))
- Summary vector (sum, count, summarizer, summarizer) SquareSum, normL1, min, Max, numNonZero
- Call summarizer 3.2.1)
- Let me call… entdim = summary.map
- Can we get tuple2. of(… entdim, meanVar)
- 4)preProcess(initData, params, statInfo. F1) performs standardization and interception of input data. The meanVar obtained above will be passed in as an argument. This is the normalization of x-VEC. For example, the original input Row is “(1.0,16.8,1.0 7.0 9.0)”, where x-vec is “1.0 7.0 9.0”. {the first item is fixed value “1.0 “, so the 4 items are “1.0 1.0 1.4657097546055162 1.4770978917519928”}, So the converted Row is “(1.0,16.8,1.0 1.0 1.4657097546055162 1.4770978917519928)”. Weight is 1.0, y-value is 16.8, and the next four are X-vec.
- This completes the data processing.
- 5) Call optimize(Params, statinfo.f0, trainData, linearModelType) to minimize the loss function to optimize the model. (Using L-BFGS algorithm, will be explained separately)
- 6) Call mapPartition(new CreateMeta()) to prepare the model metadata.
- 7) Call mapPartition(New BuildModelFromCoefs) to build the model.
As you can see, data preparation is a big part of the process, so let’s take a look at a few steps of data preparation.
5.1 Obtaining Label Information
The code here corresponds to 1 of the basic flow above)
Because there was a distinct operation before, the weight is removed. Finally, the possible value range of label is 0,1, and the type is Double.
private Tuple2<DataSet<Object>, TypeInformation> getLabelInfo(BatchOperator in,
Params params,
boolean isRegProc) {
String labelName = params.get(LinearTrainParams.LABEL_COL);
// Prepare label valuesDataSet<Object> labelValues; TypeInformation<? > labelType =null;
if (isRegProc) {
// Because it is regression, so it is here
labelType = Types.DOUBLE;
labelValues = MLEnvironmentFactory.get(in.getMLEnvironmentId())
.getExecutionEnvironment().fromElements(new Object());
} else{... }return Tuple2.of(labelValues, labelType);
}
Copy the code
5.2 Convert input to triples
The code here corresponds to 2) of the basic flow above.
The transform function converts the input to the triplet Tuple3
. Specifically, the three features of the input “f0”, “F1 “, and” F2 “will be converted into a vector vec, which we will later call x-vec. The point is that the feature becomes a vector. So this triple can be thought of as < weight, y-value, x-vec>.
private DataSet<Tuple3<Double, Double, Vector>> transform(BatchOperator in,
Params params,
DataSet<Object> labelValues,
boolean isRegProc) {
......
/ / get the Schema
TableSchema dataSchema = in.getSchema();
// Get various indexes
intlabelIdx = TableUtil.findColIndexWithAssertAndHint(dataSchema.getFieldNames(), labelName); .intweightIdx = weightColName ! =null ? TableUtil.findColIndexWithAssertAndHint(in.getColNames(), weightColName) : -1;
intvecIdx = vectorColName ! =null ? TableUtil.findColIndexWithAssertAndHint(in.getColNames(), vectorColName) : -1;
Tuple3
,>
return in.getDataSet().map(new Transform(isRegProc, weightIdx, vecIdx, featureIndices, labelIdx)).withBroadcastSet(labelValues, LABEL_VALUES);
}
Copy the code
The corresponding variable here is printed as theta
params = {Params@2745} "Params {featureCols=["f0","f1","f2"], labelCol="label", predictionCol="linpred"}"
labelValues = {DataSource@2845}
isRegProc = true
featureColNames = {String[3] @2864}
0 = "f0"
1 = "f1"
2 = "f2"
labelName = "label"
weightColName = null
vectorColName = null
dataSchema = {TableSchema@2866} "root\n |-- svec: STRING\n |-- vec: STRING\n |-- f0: DOUBLE\n |-- f1: DOUBLE\n |-- f2: DOUBLE\n |-- label: DOUBLE\n"
featureIndices = {int[3] @2878}
0 = 2
1 = 3
2 = 4
labelIdx = 5
weightIdx = -1
vecIdx = -1
Copy the code
At runtime, the transform. map function is entered. As we can see, the three features of the input, “F0 “,” F1 “, “F2 “, will be converted into a vector, vec, which we will call x-vec.
private static class Transform extends RichMapFunction<Row.Tuple3<Double.Double.Vector>> {
@Override
public Tuple3<Double, Double, Vector> map(Row row) throws Exception {
// Get the weightDouble weight = weightIdx ! = -1 ? ((Number)row.getField(weightIdx)).doubleValue() : 1.0;
/ / get the label
Double val = FeatureLabelUtil.getLabelValue(row, this.isRegProc,
labelIdx, this.positiveLableValueString);
if(featureIndices ! =null) {
/ / for X-ray vec
DenseVector vec = new DenseVector(featureIndices.length);
for (int i = 0; i < featureIndices.length; ++i) {
vec.set(i, ((Number)row.getField(featureIndices[i])).doubleValue());
}
// Build triples
return Tuple3.of(weight, val, vec);
} else {
Vector vec = VectorUtil.getVector(row.getField(vecIdx));
returnTuple3.of(weight, val, vec); }}}Copy the code
If the corresponding original input Row of (” $3 $1-0. 1:7. 0 0 2:9. 0 “, “1.0 7.0 9.0, 1.0, 7.0, 9.0, 16.8), and the variables in the program are:
row = {Row@9723} "$3 $1-0. 0 1:7. 0 2:9. 0,1.0 7.0 9.0, 1.0, 7.0, 9.0, 16.8,"
weight = {Double@9724} 1.0
val = {Double@9725} 16.8
vec = {DenseVector@9729} "1.0 7.0 9.0"
vecIdx = -1
featureIndices = {int[3] @9726}
0 = 2
1 = 3
2 = 4
Copy the code
5.3 Obtaining Statistical Variables
Perform standardization and interception of input data with getStatInfo().
The code here corresponds to 3 of the basic flow above)
- Get statistical variables, including vector size, mean, and variance, using statInfo = getStatInfo(). The process here is complicated.
- 3.1) with trainData. Map {return value. The f2; } to obtain the X-VEC in the training data.
- 3.2) Call StatisticShelper. summary to do x-VEc processing
- Call summarizer 3.2.1)
- 3.2.1.1) call mapPartition (new VectorSummarizerPartition (bCov))
- 3.2.1.1.1) call VectorSummarizerPartition mapPartition, its traversal list, each of the variables in the list of sv is x – vec. SRT = srt.visit(SV), which recalculates count, sum, squareSum, normL1.. , thus obtaining these statistics for each column of the input in this Partiton.
- 3.2.1.2) calls to reduce (VectorSummarizerUtil. Merge (value1, value2)) to merge the result of each partition.
- 3.2.1.1) call mapPartition (new VectorSummarizerPartition (bCov))
- Summary vector (sum, count, summarizer, summarizer) SquareSum, normL1, min, Max, numNonZero
- Call summarizer 3.2.1)
- Let me call… entdim = summary.map
- Can we get tuple2. of(… entdim, meanVar)
private Tuple2<DataSet<Integer>, DataSet<DenseVector[]>> getStatInfo(
DataSet<Tuple3<Double, Double, Vector>> trainData, final boolean standardization) {
if (standardization) {
DataSet<BaseVectorSummary> summary = StatisticsHelper.summary(trainData.map(
new MapFunction<Tuple3<Double, Double, Vector>, Vector>() {
@Override
public Vector map(Tuple3<Double, Double, Vector> value) throws Exception {
return value.f2; // Get the x-veC in training data
}
}).withForwardedFields());
DataSet<Integer> coefficientDim = summary.map(new MapFunction<BaseVectorSummary, Integer>() {
public Integer map(BaseVectorSummary value) throws Exception {
return value.vectorSize(); / / for dimension}}); DataSet<DenseVector[]> meanVar = summary.map(new MapFunction<BaseVectorSummary, DenseVector[]>() {
public DenseVector[] map(BaseVectorSummary value) {
if (value instanceof SparseVectorSummary) {
// Calculate min, Max
DenseVector max = ((SparseVector)value.max()).toDenseVector();
DenseVector min = ((SparseVector)value.min()).toDenseVector();
for (int i = 0; i < max.size(); ++i) {
max.set(i, Math.max(Math.abs(max.get(i)), Math.abs(min.get(i))));
min.set(i, 0.0);
}
return new DenseVector[] {min, max};
} else {
/ / standardDeviation calculation
return newDenseVector[] {(DenseVector)value.mean(), (DenseVector)value.standardDeviation()}; }}});returnTuple2.of(coefficientDim, meanVar); }}Copy the code
5.4 Standardize and interpolate the input data
This corresponds to 4) of the basic flow.
Perform standardization and interception for input data. The meanVar obtained above is passed in as an argument. This is the normalization of x-VEC.
For example, the original input Row is “(1.0,16.8,1.0 7.0 9.0)”, where x-vec is “1.0 7.0 9.0”, the first item is fixed value “1.0 “, 1.0 1.0 1.4657097546055162 1.4770978917519928 So the converted Row is “(1.0,16.8,1.0 1.0 1.4657097546055162 1.4770978917519928)”.
Why is the first term fixed at “1.0 “? Because f of x is equal to w to the Tx plus b, we should have a constant b, which is 1.0, which is the initial value of b.
private DataSet<Tuple3<Double, Double, Vector>> preProcess(
return initData.map(
new RichMapFunction<Tuple3<Double, Double, Vector>, Tuple3<Double, Double, Vector>>() {
private DenseVector[] meanVar;
@Override
public Tuple3<Double, Double, Vector> map(Tuple3<Double, Double, Vector> value){
// value = {Tuple3@9791} "(1.0,16.8,1.0 7.0 9.0)"
Vector aVector = value.f2;
// aVector = {DenseVector@9792} "1.0 7.0 9.0"
if (aVector instanceof DenseVector) {
DenseVector bVector;
if (standardization) {
if (hasInterceptItem) {
bVector = new DenseVector(aVector.size() + 1);
bVector.set(0.1.0); // Set a fixed value
for (int i = 0; i < aVector.size(); ++i) {
// Normalize and interpolate input data
bVector.set(i + 1, (aVector.get(i) - meanVar[0].get(i)) / meanVar[1].get(i)); }}}// bVector = {DenseVector@9814} "1.0 1.0 1.4657097546055162 1.4770978917519928"
return Tuple3.of(value.f0, value.f1, bVector);
}
}
}).withBroadcastSet(meanVar, MEAN_VAR);
}
// Here is the x-veC normalization. For example, the original input Row is "(1.0,16.8,1.0 7.0 9.0)" where x-vec is "1.0 7.0 9.0". Is "1.0 1.0 1.4657097546055162 1.4770978917519928", so converted Row is "(1.0,16.8,1.0 1.0 1.4657097546055162 1.4770978917519928)"
Copy the code
At this point, input processing is complete.
For example, the original input Row is “(1.0,16.8,1.0 7.0 9.0)”, where x-vec is “1.0 7.0 9.0”.
After standardization, x-VEC becomes 4 items: {the first item is fixed value “1.0 “, so the 4 items are “1.0 1.0 1.4657097546055162 1.4770978917519928”},
The converted Row is “(1.0,16.8,1.0 1.0 1.4657097546055162 1.4770978917519928)”. Weight is 1.0, y-value is 16.8, and the next four are X-vec.
Now we can start to optimize the model. Stay tuned.
0 XFF reference
Finally, we understand directional derivatives and gradients
Introduction to derivative, Directional derivative, Gradient and Gradient Descent
Gradient vector and gradient descent
Intuitively understand gradients, as well as partial derivatives, directional derivatives and normal vectors
Gradient and Gradient Descent
Gradient and gradient descent
Gradient descent algorithm process detailed interpretation
www.zhihu.com/question/25,…).
Hessian matrix and its application in image
Blog.csdn.net/weixin_3944…).
Distributed Machine Learning Algorithms, Theory and Practice liu Tieyan
zhuanlan.zhihu.com/p/29672873)
www.zhihu.com/question/36…
zhuanlan.zhihu.com/p/32821110)
Blog.csdn.net/hei65377991…).
CRF L-BFGS Line Search principle and code analysis
Step size and learning rate
Blog.csdn.net/IMWTJ123/ar…).
Linear Regression, Gradient Descent
Machine learning series 3 — Objective function and loss function
★★★★ Thoughts on life and technology ★★★★★
Wechat official account: Rosie’s Thoughts
If you want to get a timely news feed of personal articles, or want to see the technical information of personal recommendations, please pay attention.