Vectorization of deep learning formulas

This is the 18th day of my participation in the August Challenge

Σ (° delta ° | | |) ︴ this article to tell how to simplify the calculation formula for vectorization, may with all nonsense, all marked with a reference symbol, can watch.

How to put aside people’s subjective logic thinking? How to analyze the problem in a more computer-friendly way?

Maybe a lot of people don’t understand my question, so LET me give you an example. Because I was an undergraduate student of biology, the undergraduate computer is a minor. At first I had no intention of simplifying.

Like a simple decimal to binary. In my view, there are logically two ways:

Method one: See what power of 2 this number is

$819 = 1 \times 2^9 + 1 \times 2^8 + 0 \times 2^7 + 0 \times 2^6 + 1 \times 2^5 + 1 \times 2^4 + 0 \times 2^3 + 0 \times 2^2 + 1 \times 2^1 + 1 \times 2^0$

Method two: except residual method

$819 \div 2 = 409 … 1$

$409 \div 2 = 204 … 1$

$204 \div 2 = 102 … 0$

$102 \div 2 = 51 ….. 0$

$51 \div 2 = 25 ……. 1$

$25 \div 2 = 12 ……. 1$

$12 \div 2 = 6 ……… 0$

$6 \div 2 = 3 ……….. 0$

$3 \div 2 = 1 ……….. 1$

$1 \div 2 = 0 ……….. 1$

At the time, in my mind, I thought method one was the easiest way to look at it, just to see what power it is. Until I once encountered a base conversion suddenly confused… Yi? What can I do? Divided by two to the power of two. Two arrays, one for two to the NTH power and one for divisor?

And THEN I went to ask for help, and the computer senior said you learned this one method? I said no, I know something else. He said, well, you can write it in a different way. And then I realized that it’s much easier to do the division remainder method.

This is what I said is more suitable for computer analysis to deal with the problem. Maybe my supervisor thinks that one method is simple, but another method is more convenient for computers to realize. Therefore, for some problems, we need to change our thinking and find a method more suitable for computer processing.

So for deep learning, how can “complex formulas” simplify calculations?

Here’s a simple chestnut:

$h_{\theta}(x)=\sum_{j=0}^{n} \theta_{j} x_{j}$

If you had to do this by hand, you’d probably think, well, it’s just a simple one time function, and you just bring it in and you just do it. But if the “bring it in” method is implemented by a computer, it looks like this:

Declare two vectors, iterate.

Unvectorized implementation

C + + :

double prediction = 0.0; 
for(int j = 0; J < = n; j++) { prediction += theta[j]*x[j]; }Copy the code

Octave:

prediction =0.0;for j = 1:n+l
    prediction = prediction + theta(j) * x(j)
end
Copy the code

Unvectorization was mentioned above. What is vectorization? It’s just turning your cognitive formula into a multiplication of matrices and vectors. In this way, the computer can process the problem more quickly, which is what I mentioned earlier.

Theta h (x) = ∑ j = 0 n theta jxjh_ {\ theta} (x) = \ sum_ ^ {j = 0} {n} \ theta_ {j} x_ theta {} j h (x) = ∑ j = 0 n theta JXJ vectorization can be seen as theta Tx \ theta ^ {T} x theta Tx and calculated.

This translates into two vectors θ=[θ0θ1θ2… x=[x0x1x2…] \theta=\left[\begin{array}{l}\theta_{0} \\ \theta_{1} \\ \theta_{2} \\ … \end{array}\right] \quad X = \ left [\ begin {array} {l} x_ {0} \ \ x_ {1} \ \ x_ {2} {array} \ \ \… \ end right] theta equals ⎣ ⎢ ⎢ ⎢ ⎡ theta zero theta. Theta 1 2… ⎦ ⎥ ⎥ ⎥ ⎤ x = ⎣ ⎢ ⎢ ⎢ ⎡ x0x1x2… The product of ⎥⎥⎥⎤.

Vectorized implementation

Octave:

prediction= theta'* x;Copy the code

C + + :

double prediction = theta.transpose()*x;
Copy the code

Why is vectorization faster than you do with arrays? Because vectorization can be more convenient to use some language built-in libraries. For example, the theta’ single quotation mark ‘in Octave above denotes the transpose of a matrix or vector (forget to go back to octave syntax). Theta. Transpose () in C++ also uses the linear algebra library transpose function.

As a student of biology, I used C++ as an introductory language. At the beginning of learning data structure and algorithm, such as sorting, I knew all kinds of sorting and its efficiency. But if you ask me which sorting algorithm is efficient for a set of data with which features, I will consider which one is better, otherwise I will simply bubble up. Once my senior said to me: you said C++ library functions used which sort. I don’t know. Then he said: you can have a look at the library function source code, such as C++ sort, not a single sort but a combination of sorts, are all kinds of research and optimization algorithms.

The use of high-level language library, more convenient for our operation. These library functions are created and optimized by various computer masters, and are thousands of times easier to write than we are.

Advantages of using the built-in algorithm:

faster
With less code
It’s much less error-prone than writing it yourself
Better compatibility with hardware systems

With that said, here’s another chestnut:

\begin{array}{l} \theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} \end{array} \text{ for all j}

This is a familiar formula for gradient descent. In this formula x is the data matrix and y is the column vector.

For multiple linear regression:

\begin{array}{l} \theta_{0}:=\theta_{0}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)} \\ \theta_{1}:=\theta_{1}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{1}^{(i)} \\ \theta_{2}:=\theta_{2}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{2}^{(i)} \end{array} \\…

Let’s simplify the gradient descent formula to make it easier to code:

\theta:=\theta-\alpha \delta

\delta = \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x^{(i)}

δ=[δ0δ1δ2… [\ \ delta = \ left the begin {array} {l} \ delta_ {0} \ \ \ delta_ {1} \ \ \ delta_ {2} {array} \ \ \… \ end right] delta = ⎣ ⎢ ⎢ ⎢ ⎡ delta 0 delta 1 delta 2… ⎦ ⎥ ⎥ ⎥ ⎤

For x, x(I)=[x0(I)x1(I)x2(I)… x^{(i)}=\left[\begin{array}{c} x_{0}^{(i)} \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\… {array} \ \ end right] x (I) = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ x0 (I) the x1 x2 (I) (I)… ⎦ ⎥ ⎥ ⎥ ⎥ ⎤

\ delta = number of x (…). X x ^ {(I)}

So it must be δn×1= number ×(…) ×x1×n(I)\delta_{n \times 1} = number ×(…) X x_ 1 x n} {^ {} (I) in the delta n * 1 = number of * (…). X1 times n of I, xj of I x_j^{(I)}xj of I transpose.

This logic would take many cycles to complete without vectorization. I wrote it in c++, I don’t guarantee it, but you can just look at it, and I didn’t run it.

for(int j=0; j<n; j++) {// First compute h(x)
    for(int i=0; i<len; i++) { hx[i] += theta[i][j]*x[i][j]; }// Calculate [h(x)-j]*x and sum
    for(int i=0; i<len; i++) { sum += (h[i] - y[i])*x[i][j]; }// The rest of the formula
    theta[j] = theta[j] - alpha * (1/m) * sum;
}
Copy the code

But if you vectorize it you can write it as:

% suppose it's now binary hx = X * theta; theta(1) = theta(1) - alpha * (1/m) * sum((hx-y)*X(:,1:1))
theta(2) = theta(2) - alpha * (1/m) * sum((hx-y)*X(:,2:2Note that octave differs from other languages such as C, with subscripts from1startCopy the code

% if n elements hx = X * theta;for i = 1:n
    theta(i) = theta(i) - alpha * (1/m) * sum((hx-y)*X(:, I: I)) % Note that octave differs from other languages such as C, with subscripts from1Start endforCopy the code

Vectorization is fine, but if you don’t vectorize, you might end up with a lot of nested for loops. So use vectorization to reduce your workload.

Vectorization of deep learning formulas

Here’s a simple chestnut:

With that said, here’s another chestnut:

Related Posts

1. Overview of data analysis

A brief history of images in deep learning

Short-term load forecasting of power system based on MATLAB SVM