OLS (Ordinary Least Squares)

  • Formula:

  • Derivation: None yet

  • Calculation:

    • The slopeb = SUM(X[i] * Y[i] - AVER(X) * AVER(Y)) / SUM(X[i]^2 - AVER(X)^2)
    • The offseta = AVER(Y) - k * AVER(X)
    • Regression functionf(x) = bx + a
/* sample: [[1,3],[2,4],[3,7],[4,8],[5,6] The least square method is to sum the error (prediction result and input data) of each item in the data by setting the fitting line -> f(x) = K * x + b, and obtain the overall difference D(x, Y) = SUM((Y{I} -f (X{I}))^2) = SUM(Y{I} -f (X{I}))^2) = SUM(Y{I} -f (X{I}) ^2
// set X = [X], Y = Y
/ / k = SUM (X * Y [I] - [I] AVER AVER (X) * (Y))/SUM (X [I] ^ 2 - AVER (X) ^ 2)
// b = AVER(Y) - k * AVER(X)
// f(x) = k * x + b
function OLS(S) { // select * from S; // select * from S
    const X = S.map(s= > s[0]) X / / sample
    const Y = S.map(s= > s[1]) / Y/sample
    const XA = AVER(X)
    const YA = AVER(Y)
    const SUM_S = SUM(S, ([xi, yi]) = > xi * yi - XA * YA)
    const SUM_X = SUM(X, xi= > Math.pow(xi, 2) - Math.pow(XA, 2))
    const k = SUM_S / SUM_X
    const b = YA - k * XA
    return (x) = > k * x + b // f(x) = a * x + b
}

function SUM(X, F) {
    return X.reduce((a, b) = > a + (F ? F(b) : b), 0)}function AVER(X, F) {
    return SUM(X, F) / X.length
}
Copy the code

GD, Gradient Descent

Gradient descent to find the regional minimum

  • Formula:

  • Deduction: zhuanlan.zhihu.com/p/131320496

  • Calculation:

    • Iterations:x{n+1} = x{n} - * f'(x) * D
    • Poor value:dv = |x{n+1} - x{n}|

/ * is derived: The purpose of gradient descent is to update the vector X so that the updated value of f of X is minimized by comparing the slope of the tangent point using differentials, Converts the comparison of f (X) to more simple comparison of vector X (Taylor expansion) - > f (X) = f (X {0}) + f '{0} (X) * (X - X {0}) + A / / A: Dimensionless, f '{0} (x) as the slope (transform) - > f (x) - f' {0} (x) = f (x {0}) * (x - x {0}) (matrix) set x = [x], X {0} = {0} [X] / / will is to expand up to latitude (variable into) - > f (X) - {0} (X) = f (X - X {0}) * f '{0} (X) * D / / D: Delta, f'(X{0}) is the gradient direction because 'x-x {0} = vector - vector = vector = scalar * unit vector' let the unit vector be V, The scalar is 1 (that is, ignore the scalar) -> f(X) -f (X{0}) = V * f'(X{0}) * D because I expect V * f'(X{0}) * D < 0 // What I understand is that by taking partial derivatives on all axes of the vector, Then take the derivative of the function and obtain the maximum value according to the relevant curvature transformation (maximum gradient // direction derivative/obtain the maximum gradient: https://zhuanlan.zhihu.com/p/24913912 - > V = - (f '{0} (X)/(X {0}) | | f') / / V is the unit vector, so divided by the modulus, the fastest decline in the opposite direction, Plus negative number (into) - > V = {0} X - X = - (f '{0} (X) / | f' {0} (X) |) * D / / D for scalar (reduced scalar) - > X - X {0} = -f '{0} (X) * D1 / / D1 for new scalar, The D1 = D / | f '{0} (X) | (result) is derived - > X = X {0} - f' {0} (X) * D * / 
// x{n+1} = x{n} - f'(x) * D
function GD(x, d) { 
    let xn = f(x) // x{n+1}
    let dv = 9999999999 // x{n+1} - x{n}
    let i = 0
    while (dv > 0.000000001 && i < 9999) {
        i++
        x = xn
        xn = x - _f(x) * d // x{n+1} = x{n} - f'(x{n}) * D
        dv = Math.abs(xn - x)
    }
    return f(xn)
}

function f(x) { // f(x) = x^2
    return Math.pow(x, 2)
    // return Math.cos(x)
}
function _f(x) { // f'(x) = 2x
    return x * 2
    // return -Math.sin(x)
}

Copy the code

Batch gradient descent (BGD)

  • Formula:

  • Calculation:
    • Partial derivative of k:h{k}' = - SUM( Y{i} - h(X{i}) ) * X{i} / n
    • Partial derivative b:h{b}' = - SUM( Y{i} - h(X{i}) ) / n
    • Iteration k:k{n+1} = k{n} - h{k}' * D
    • Iterative b:b{n+1} = b{n} - h{b}' * D
    • Regression function:h(x) = k * x + b
/* Derivation: Batch gradient descent actually adopts the method of least square + gradient descent to fit the function. The traditional least square method adopts the method of partial differentiation to fit the function, while the gradient descent works out the local minimum rather than the whole minimum fitting line: h(x) = K * x + b Difference function: D(X, Y) = SUM((Y{I} -h (X{I}))^2) / 2 * n generally adopt the mean square error function. X {n+1} = x{n} -f '(x) * k B partial derivative gradient / / in theory should be able to get the parameter list as a vector derivation - > h {k} '= - SUM (Y {I} {I} (X) - h) * X {I} / n - > {b}' h = - SUM (Y {I} {I} (X) - h)/n Derived results - > k {n} {n + 1} = k + h {k} '* D - > {n + 1} b = b + h {b} {n}' * D * /
// k{n+1} = k{n} - h{k}' * D
// b{n+1} = b{n} - h{b}' * D
// h(x) = k * x + b
function BGD(S, D) {
   let i = 0
   let k = 0
   let b = 0

   function f(x) {
       return k * x + b
   }
   function _fk() {
       return - SUM(S, ([xi, yi]) = > (yi - f(xi)) * xi) / S.length
   }
   function _fb() {
       return - SUM(S, ([xi, yi]) = > yi - f(xi)) / S.length
   }

   let kn = k
   let bn = b
   let dv = 9999999

   while (dv > 0.00000001 && i < 9999) {
       i++
       k = kn
       b = bn
       kn = k - _fk() * D
       bn = b - _fb() * D
       dv = Math.abs(kn - k) + Math.abs(bn - b)
   }
   return (x) = > k * x + b
}

function SUM(X, F) {
   return X.reduce((a, b) = > a + (F ? F(b) : b), 0)}Copy the code

Stochastic Gradient Descent (SGD)

  • Formula:

MBGD (Mini-Batch Gradient Descent)

  • The formula

/ * is derived: The stochastic gradient descent is equivalent to omitting the mean pair step of the data set in BGD, Or h {k} '= - SUM (Y {I} {I} (X) - h) * X {I}} {k/n - > h' = - (Y {I} {I} (X) - h * X {I}) h {b} '= - SUM (Y {I} {I} (X) - h)/n -> h{b}' = - (Y{I} -h (X{I}))) small batch gradient descent is equivalent to taking a subset of the data set when calculating the mean value, Or h {k} '= - SUM (Y {I} {I} (X) - h) * X {I} - > {k}' h = - STO {n} ({I} - Y h (X) {I} {I}) * X h = {b} '- the SUM (Y {I} {I} (X) - h) - > H {b} '= - STO {n} ({I} - Y h {I} (X)) about why the stochastic gradient descent can convergence: https://www.zhihu.com/question/27012077/answer/501362912 * /
// h(x) = k * x + b
function SGD(S, D) {
    return MBGD(S, D)
}
function MBGD(S, D, N) {
    let i = 0
    let k = 0
    let b = 0

    function f(x) {
        return k * x + b
    }
    function _fk() {
        return - STO(S, ([xi, yi]) = > (yi - f(xi)) * xi, N)
    }
    function _fb() {
        return - STO(S, ([xi, yi]) = > yi - f(xi), N)
    }

    let kn = k
    let bn = b
    let dv = 9999999

    while (dv > 0.00000001 && i < 9999) {
        i++
        k = kn
        b = bn
        kn = k - _fk() * D
        bn = b - _fb() * D
        dv = Math.abs(kn - k) + Math.abs(bn - b)
    }
    return (x) = > k * x + b
}

// If N is 1, it is stochastic gradient descent. SGD is a special case of small batch gradient descent
function STO(X, F, N = 1) {
    const P = [...X]
    const Q = []
    for(let i = 0; i < N; i++) {
        const idx = Math.floor(Math.random() * P.length)
        const item = P.splice(idx, 1) Q.push(... item) }return SUM(Q, F) / Q.length
}

function SUM(X, F) {
    return X.reduce((a, b) = > a + (F ? F(b) : b), 0)}Copy the code

Algorithm result test

With traditional OLS algorithm as the standard, the results of BGD and OLS tend to be basically stable in the case of a small number of data samples. BGD, MBGD and SGD also tend to be basically stable when the “learning rate” is set correctly. Therefore, it should be a very important problem to set the “learning rate” so that it converges (of course, there may also be problems with algorithm implementation ORz)

I can’t figure it out at the moment

  1. The convergence of “learning rate” in gradient descent, why is the learning rate generally set to 0.01 in gradient descent? The experiment shows that the learning rate is too large to change easily and does not converge. What is the boundary of convergence of gradient descent? Is it related to the data set?

(Both GD and BGD have non-convergence. 2. The dynamic condition of “learning rate” according to the current gradient is currently an optimization problem, considering whether the normalization scheme commonly used in the algorithm is related to the convergence of learning rate of gradient descent