0 x00 the

Based on the paper Automatic Differentiation in Machine Learning: A Survey, this paper and the following will gradually analyze Automatic Differentiation as a basic tool of Machine Learning.

0.1 origin

The author planned to analyze the distributed implementation of PyTorch, but in the analysis of distributed Autograd, it was found that the analysis of distributed Autograd would be difficult if the automatic differentiation and ordinary Autograd engine were not sorted out clearly. Therefore, I went back to learn this article.

0.2 Automatic differentiation

As we know, the basic process of deep learning framework training model is as follows:

  • Build the calculation diagram according to the model.
  • Calculate the loss function from the input.
  • Calculate the derivatives of the loss function with respect to model parameters.
  • According to the derivative obtained, the gradient descent method and other methods are used to reverse update model parameters to minimize the loss function.

The process of constructing the calculation graph (dependency) and calculating the loss function is called “forward propagation”, which is done according to the user model and is essentially handled by the user himself. The derivation process based on loss function is called “back propagation”, which is too heavy for users, so various deep learning frameworks provide automatic derivation function. There are two core problems that deep learning frameworks help us solve:

  • Automatic gradient calculation and update during backpropagation.
  • Use GPU for calculation.

And this is where the idea of automatic gradient computation comes in.

In mathematics and computational algebra, Automatic Differentiation or AD is also called a differential algorithm or numerical Differentiation. It is a numerical method of computing the derivatives, gradients, Hessian matrix values, and so on of complex functions (multilayer composite functions) at a certain point.

0x01 Basic Concepts

To complete the article, we will first introduce some basic concepts or ideas, some of which you may already be familiar with. Please skip to chapter 2.

1.1 Machine learning

Herbert Simon (Turing Prize winner in 1975 and Nobel Prize winner in Economics in 1978) gave a definition of “learning” : “If a system can improve its performance by performing a process, that process is learning”.

So machine learning is about learning from empirical data and extracting important patterns and trends in the data to improve the performance of predictive functions (functional functions about specific inputs and expected outputs).

For example, a function can be used to distinguish between cats and dogs. We need to use a lot of training data to mine and cultivate this function and improve its performance.

1.2 Deep learning

Traditional machine learning uses knowledge and experience to extract features from raw data for training. Feature extraction is an important part of machine learning: feature engineering. Feature engineering is a very challenging part because of the huge number of features involved in the original data, and the variety of features. Deep learning allows the neural network to learn/extract various features of the data itself, or to form some high-level features by combining various low-level features.

1.3 Loss function

For machine learning functions, given an input, we get an output, such as an input cat, and the output is “cat”. But this actual output value may be different from what we expect. Therefore, we need to build an evaluation system to distinguish between good and bad functions, which leads to the loss function.

Loss function or cost function is used to measure the degree of “difference” or accuracy between expected output (true value) and actual output (predicted value).

The loss function can quantify the degree of model fitting into a function value. If we choose different model parameters, the value of the loss function will be changed accordingly. The smaller the loss function value is, the smaller the difference between the actual output and the expected output is, and the higher the accuracy of the constructed model is. For example, the common Mean Squared Error loss function can be regarded as the square of the average distance between the predicted value and the actual value in a geometric sense.

1.4 Weight and bias

There are two kinds of parameters in a loss function:

  • We call the degree of influence between neurons the weight. The weight is the strength of the connection. It informs the next layer of neighboring neurons which input semaphores to pay more attention to.
  • In addition to connection weights, there is a special weight applied to the neuron itself, called bias. Bias is used to adjust the distance between the function and the true value. Whether bias can make auxiliary neurons more easily activated. That is, it determines how weighted the connections of neurons must be to make firing meaningful.

Neural network architecture is designed to allow neural networks to learn with “better” performance. The so-called “learning” here is to constantly adjust the weight and bias, so as to find the most appropriate weight and bias between neurons, so as to minimize the value of the loss function.

1.5 Derivatives and gradients

One of the characteristics of neural networks is to learn from data samples. In other words, the network weight parameters can be automatically determined by training data.

Now that we have the loss function evaluation system, we can use it to reverse adjust the weights in the network to minimize the loss, that is, if some weights make the loss function reach the minimum value, these weights are the most ideal parameters we seek.

If the loss function is y = f(x), and we find its minimum, we find the extreme point of the function, then we find the solution of the differential equation f'(x) = 0. But computers are not good at solving differential equations, so they can only try to find the extreme point of the function through interpolation and other methods.

What is a derivative?

A derivative is a measure used to analyze the “rate of change” of a function. For a particular point in the function, x0, the derivative of that point is the instantaneous slope at x0, the slope of the tangent line.

What is a gradient?

The original meaning of gradient is a vector (vector), indicating that the directional derivative of a function at this point obtains the maximum value along this direction, that is, the function changes fastest along this direction (the direction of the gradient) at this point, and the change rate is the largest (is the modulus of the gradient).

In a real valued function of one variable, at a particular point in the function, the gradient is the direction in which the value of the function increases the most rapidly or the derivative of the function changes the most from that point.

For machine learning/deep learning, the gradient direction is the direction in which the loss function changes the fastest. Because we hope to minimize the loss, we usually use numerical differentiation to calculate the gradient of the weight parameters of the neural network, determine the direction of parameter tuning according to the gradient descent, and optimize in this direction.

1.6 Gradient descent

The general idea of gradient descent is as follows: firstly, set some random initial values for parameters W and B, then use iterative algorithm to calculate the output of the current network, and then change the parameters of the preceding layers in the opposite direction according to the difference between the network output and the expected output until the network convergence is stable.

Specifically, it is:

  1. Give the parameters W and b random initial values.

  2. For I = 0 to number of training data:

    • According to the initial values of parameters W and B, the output of the current network for the ith training data is calculated
    • According to the difference between the network output and the expected output, a gradient of the weight W and the deviation B with respect to the loss function is obtained.
  3. Finally, a gradient value of weight and deviation is obtained for each training data.

  4. Add up the weight gradient of each sample data to calculate the average weight gradient of all samples.

  5. Add up the deviation gradient of each sample data and calculate the average deviation gradient of all samples.

  6. Update weight value and deviation value:

    • w = w –
    • b = b –
    • Return 2 and continue iterating until the network converges.

Of course, there are many ways to optimize gradient descent, with different logic.

1.7 Reverse Propagation

When talking about back Propagation algorithm, we usually emphasize back propagation. In essence, it’s a bidirectional algorithm. In other words, it takes two big steps:

To spread: before the bulk data into network, computing & positive transmission input information (is a series of matrix by activating function processing, one layer forward “spread”, arrived in output layer), until the final output of the predicted values and the real label after comparison, with the loss of loss function to calculate the iteration, the focus is how the input affects the each layer.

Back propagation: the error of back propagation starts from the last end of the network and enters each layer before the network model. According to the derivation of the chain, the weight and bias of the network are adjusted gradually, so as to minimize the gap between the final output and the expected output. The focus is how each layer affects the final result.

The details are as follows:

As we can see, there is a lot of gradient calculation involved in this diagram, which brings up the question: how to calculate these gradients? The deep learning framework helps us solve two core problems:

  • Automatic gradient calculation and update during backpropagation, also known as automatic differentiation.
  • Use GPU for calculation.

1.8 Differentiable programming

1.8.1 Differentiable programming immortality

Yann Lecun writes in his blog post “Deep Learning is Dead, Differentiable Programming Lives forever” :

“Deep learning is essentially a new way of programming — differentiable programming — and the field is trying to formulate reusable structures in this way. So far we have: convolution, pooling, LSTM, GAN, VAE, memory units, routing units, etc.”

But the important point is that people are now assembling networks of parameterized function modules, building new kinds of software, and training them with some kind of gradient-based optimization.

More and more people are programmatically defining networks in data-dependent ways (loops and conditions) that change as input data changes dynamically. This is very similar to a normal program, except that the former is parameterized, automatically differentiable, and trainable and optimized. Dynamic networks are becoming increasingly popular (especially for NLP) thanks to deep learning frameworks like PyTorch and Chainer (note: Back in 1994, Lush, a previous deep learning framework, was able to handle a special dynamic network called Graph Transformer Networks for text recognition).

1.8.2 Key to the success of deep learning

David Dalrymple of the MIT Media Lab also talked about differentiable programming. Dalrymple believes that there are two keys to the success of deep learning: backpropagation and weight-correlation, which are the same as invoking reusable functions in functional programing. Differentiable programming has the potential to become “timeless “.

Back propagation uses the chain rule (a simple calculus trick) in a very elegant way, thus deeply integrating continuous and discrete mathematics, allowing complex families of potential solutions to be improved autonomously through vector calculus.

The key to backpropagation is to organize the pattern (template) of the potential solution into a directed graph. By traversing the graph in reverse, the algorithm automatically calculates a “gradient vector” that leads the algorithm to better and better solutions.

Weight-relativity is the second key. It allows the same weight-dependent network components to be used in multiple places at the same time, with each copy of the component being consistent. Weight-referees lead to more generalization because words or objects may appear in multiple locations within a block of text or an image. Weight-tied components are essentially the same concept as reusable functions in programming (similar to how you write a function and call it in multiple places in the program), and the way components are reused is exactly the same as the general-purpose generation of “higher-order functions” in functional programming.

1.8.3 Differential programming

Differentiable programming is a relatively new concept and is an extension of back propagation and weight-isolated. The user only specifies the structure of the function and the order in which it is called, and the function program is actually compiled into a computation diagram similar to the one needed for backpropagation. The individual components of the graph must also be differentiable, and differentiable programming leaves implementation/deployment details to the optimizer — the language uses backpropagation to automatically learn the details based on the overall program’s goals, optimizing based on gradients, just like optimizing weights in deep learning.

Andrej Karpathy, head of Artificial intelligence at Tesla, has also floated a “software 2.0” concept.

Software 1.0 is written in Python, C++ and other languages and consists of explicit instructions for the computer. By writing each line of code, the programmer determines a specific point in the program space.

Software 2.0 is written with neural network weights. No one was involved in writing this code. In the case of software 2.0, humans specify constraints on the behavior of an ideal program (for example, input-output data sets) and search the program space for programs that meet those constraints based on available computing resources. In this space, the search process can satisfy the requirements by using back propagation and stochastic gradient descent.

Karpathy says that in the real world, most of the problems are easier to gather data than to write programs explicitly. In the future, most programmers won’t need to maintain complex software libraries, write complex programs, or analyze program runtime. All they need to do is collect, collate, manipulate, tag, analyze, and visualize the data they provide to the neural network.

In summary, now that WE know the importance of Automatic Differentiation in Machine Learning: A Survey, we will learn about it in detail.

0x02 Differential method

2.1 Common Methods

Let’s first look at some of the more common ways of differentiating:

  • Manual Differentiation: Complete manually, solve the gradient formula according to the chain rule, plug in the value and get the gradient.
  • Numerical Differentiation: Use the original definition of the derivative to directly solve the differential value.
  • Symbolic Differentiation: The automatic calculation of expressions using the Differentiation rules, in which the result is the expression of the derivative function rather than the specific value. That is, the analytical solution is first found, then converted to a program, and then the program calculates the gradient of the function.
  • Automatic Differentiation: A method between numerical Differentiation and symbolic Differentiation that uses a digraph like calculation to solve for differential values.

The details are as follows:

2.2 Manual differentiation

Manual differentiation means that for each of the objective functions you need to manually write out the derivative formula using the derivative formula, and then write the code according to the formula, plug in the values, and figure out the final gradient.

This method is accurate and effective, but it is not suitable for engineering implementation, because of poor versatility and flexibility. Every time we modify the algorithm model, we have to modify the corresponding gradient solution algorithm. If the model is complex or the project is frequently iterated over and over again, algorithm engineers can’t handle 365 x 24, let alone 996.

2.3 Numerical differentiation

Numerical differentiation should be the most direct and simple way to find derivatives automatically. From the original definition of derivative, we can intuitively see that the forward difference formula is:

When h is taken to a very small value, such as 0.000001, the derivative can be approximated using difference. The numerical differential algorithm can calculate the derivative value only by giving the difference between the function value and the independent variable. The one-side difference formula directly approximates the value of the derivative at a point according to the definition of the derivative.

The advantages of numerical differentiation are:

  • The above formula works almost everywhere, unless the point is not differentiable,
  • Simple implementation.
  • Hide the solution process from the user.

However, numerical differentiation has several problems:

  • It’s a lot of work, and it’s the slowest of the methods, especially if you have a lot of parameters, because because every time you take a derivative of a parameter, you have to recalculate f of x plus h.

  • Because it’s a numerical approximation, it’s not reliable, it’s not stable, you can’t get a relatively accurate derivative. If h is improperly chosen, the result may be contrary to the sign, resulting in an increase in error. Two serious problems in particular:

    • Truncation error: Approximation error caused by the failure of h to truly take zero in numerical calculations.
    • Roundoff Error: The constant rounding of decimal places in the calculation results in the accumulation of errors in the derivation process.

In order to alleviate the truncation error, the center difference approximation is proposed. This method still cannot solve the rounding error, but only reduces the error, but it has smaller error and better stability than the one-side difference formula. The specific formula is as follows:

Although numerical differentiation has some disadvantages, it is easy to implement, so it can be used to verify the correctness of gradients obtained by other algorithms. For example, “Gradient Check” uses numerical differentiation method.

2.4 Symbolic differentiation

Symbolic Differentiation (Symbolic Differentiation) is the category of Symbolic calculation, which uses the rule of Differentiation to automatically calculate the expression, and the calculation result is the expression of the derivative function. Symbolic computation is used to solve a formal solution (also known as an analytic solution) in mathematics, yielding an expression rather than a numerical value.

Symbolic differentiation is suitable for automatic derivative of symbolic expressions. The principle of symbolic differentiation is to replace manual differentiation with the following simple derivative rules:

Symbolic differentiation uses algebraic software to realize some differential formulas, and then according to the derivation formula of basic function, four operations and the derivation rule of compound function, the calculation process of the formula is transformed into a differential process, so that the mathematical expression with closed form provided by users can be solved “automatic differentiation”. That is, you get the analytic solution, you convert it to a program, and then you compute the gradient of the function.

Symbolic differentiation computes expressions that need to be stored in strings or other data structures, such as expression trees. Mathematical software such as Mathematica, Maple and MATLAB implement this technique. The Symbolic computation library of the Python language also provides this kind of algorithm.

The problem with symbolic differentiation is:

  • The expression must be a closed form mathematical expression, that is, it must be able to write a complete mathematical expression, not loop structure, conditional structure, etc. In this way, the whole problem can be transformed into a pure mathematical symbol problem, and some algebraic software can be used for symbolic differential solution.
  • When expressions are complex (deep complex functions, such as the mapping functions of neural networks), the problem of “expression swell” tends to occur because computers may not be able to simplify intelligently.

Expression expansion is shown in the figure below. If you do not pay attention to it, symbolic differential solution will be shown in the middle column below. The expression expands rapidly, resulting in slower solution, redundancy and high cost in calculation:

In fact, for machine learning applications, you don’t need to get the expression of the derivative, you just need to calculate the derivative of the function at a certain point.

2.5 Automatic differentiation

2.5.1 Intermediate method

Automatic differentiation is a method between numerical differentiation and symbolic differentiation, which uses a digraph like calculation to solve the differential value.

  • Numerical differentiation: Directly substitute the numerical approximation at the beginning.
  • Symbolic differentiation: Directly to algebraic expression analytic solution, the final substitution of numerical calculation.
  • Automatic differential: first (function) applied to the basic operator symbol differential method, next to the numerical calculation, retain the intermediate results, at last, through the chain intermediate result was applied to the entire function derivation method, this can be totally hidden from the user differential solving process, also can be flexible in the programming language combined circulation structure, conditions, structure and so on.

We have a few more things to say about the analytic solution. Almost all machine learning algorithms can be reduced to solving optimization problems in training or prediction. If the objective function is differentiable, the problem becomes to find the stagnation point of the training function. But usually we can’t get the analytic solution of stagnation point, so we can only use numerical optimization algorithm, such as gradient descent method, Newton method, quasi-Newton method and so on. These numerical optimization algorithms depend on the first or second derivative values of functions (including gradients and Hessian matrices). Therefore, it is necessary to solve the problem of how to obtain the derivative of a complex function, and automatic differentiation technology is a general method to solve this problem.

Since the automatic differential method only applies the symbolic differential rule to basic functions or constants, it can be flexibly combined with the cyclic structure and conditional structure of programming languages. With and without automatic differentiation, the overall code change is very small. Since it is actually a graph calculation, it can be optimized a lot, so this method is widely used in modern deep learning systems.

2.5.2 Basic Mathematics

Automatic differential (AD) is the use of procedures to automate the derivation of Jacobian matrix or part of it, is a numerical calculation of the dependent variable on the derivative of an independent variable, so its mathematical basis is the chain rule and Jacobian matrix.

2.5.2.1 Chain derivative

Before we do the chain rule, let’s review the composition function. Compound functions are essentially functions of functions. It passes the return value of one function as an argument to another function, and passes the return value of another function as an argument to the next function, that is, function nested function, combining several simple functions into a more complex function.

The chain rule is the derivative rule in calculus, used to find the derivative of a complex function, is a common method in calculus derivative operation. The derivative of the composite function will be the product of the derivatives of the finite functions that make up the composite at the corresponding point, just like a chain, so it is called the chain rule.

Because the nuggets have problems with formula analysis recently, so they can only map, please understand.

2.5.2.2 Jacobian matrix

In vector calculus, a Jacobian is a matrix whose first partial derivatives are arranged in a certain way, and its determinant is called the Jacobian determinant. The jacobian is important because it represents an optimal linear approximation of a differentiable equation to a given point.

The Jacobian represents all the possible partial derivatives of two vectors. It is the gradient of one vector with respect to another vector, which implements the mapping of n – dimensional vectors to m – dimensional vectors.

In vector operations, jacobian matrix is a numerical matrix based on the first partial derivative of a function with respect to all variables, also known as the Jacobian determinant when the number of inputs = the number of outputs.

Assuming the input vector and the output vector, the Jacobian matrix is defined as:

0xEE Personal information

★★★★ Thoughts on life and technology ★★★★★

Wechat official account: Rosie’s Thoughts

0 XFF reference

Yann LeCun: Deep learning is dead, long live differentiable programming!

Automatic differential technique

TensorFlow microprogrammable Practice 2– Automatic differential notation system

En.wikipedia.org/wiki/Automa…

Pytorch learning 2020 Spring -1- Linear regression

Automatic Differentiation

Introduction to Automatic Differentiation — TensorFlow’s core principles

About the tensor in PyTorch and its use

What is (machine/Deep) learning?

【 Beauty of Deep Learning 15】 How to Understand loss Function emotionally?

What exactly is a gradient?

【 Beauty of deep learning 21】BP algorithm before the detailed propagation

【 Beauty of deep learning 22】BP algorithm detailed chain rule

The beauty of deep learning 23. Back propagation of BP algorithm

【 Deep learning theory 】 Understanding Gradient descent

Principle and implementation steps of Gradient Descent

Gradient descent method and derivation

Pure formula hand push + code push — neural network back propagation + gradient descent

Deep learning — a concrete case of back propagation

Neural network in the principle of BP algorithm and Python source code analysis

Machine learning — automatic differentiation

Automatic Differentiation in Machine Learning: a Survey

Artificial intelligence engine — automatic differentiation

An introduction to Automatic Differentiation

The Autodiff Cookbook

Automatic Differentiation in Machine Learning: a Survey

An introduction to Automatic Differentiation

Why does PyTorch’s Backward have a grad_variables parameter?

What exactly is automatic differentiation? Here is a self-description

BACKPACK: PACKING MORE INTO BACKPROP

Automatic differential technique

Differential programming (I) : three SINS of traditional automatic differentiation

A day to achieve their own automatic differentiation

Understanding the tensor, Autograd, backpropagation and graphs in Pytorch

PyTorch Automatic differential fundamentals

PyTorch Learning Notes 1.5 Autograd and logistic regression

The PyTorch Autograd

Torch. Autograd: Gradient calculation in detail

zhuanlan.zhihu.com/p/348555597)

Basic TECHNOLOGY of AI Framework: Autograd

An introduction to Automatic Differentiation

[12] Laurent Hascoet and Valérie Pascual. The tapenade automatic differentiation tool: Principles, model, and specification. ACM Transactions on Mathematical Software (TOMS), 39(3):20, 2013.

TensorFlow microprogrammable Practice 1– Introduction to automatic differentiation

TensorFlow microprogrammable Practice 2– Automatic differential notation system

zhuanlan.zhihu.com/p/163892899