A simple math genius 🦖🦖🦖 roar! Roar! Roar!

1 convex function

1.1 convex set

For any points x1,x2∈C in data set C, then the points on the line connecting x1 and x2 also belong to set C:


Theta. x 1 + ( 1 Theta. ) C ( 0 Or less Theta. Or less 1 ) \theta x_1+( 1 – \theta) \in C (\forall 0 \le\theta \le 1)

All of the data in a set, take two random data points, connect them, all of the data on the line is in the set.

Drawing code:

import numpy as np
import matplotlib.pyplot as plt
pi,sin,cos = np.pi,np.sin,np.cos
r1 = 1
theta = np.linspace(0.2*pi,36)
x1 = r1*cos(theta)
y1 = r1*sin(theta)
plt.fill(x1,y1)

plt.show()
Copy the code

The figure above is a convex set connected by any two points in the dark region, and the points on the line are all in the dark region.

import numpy as np
import matplotlib.pyplot as plt
pi,sin,cos = np.pi,np.sin,np.cos
r1 = 1
r2 = 2
theta = np.linspace(0.2*pi,36)
x1 = r1*cos(theta)
y1 = r1*sin(theta)
x2 = r2*cos(theta)
y2 = r2*sin(theta)
plt.fill(x2,y2,'r')
plt.fill(x1,y1,'w')
plt.show()
Copy the code

The top is not a convex set, any two points in the dark region are connected, the points on the line are not necessarily in the dark region.

1.2 convex function

For functions defined on convex sets f:Rd↦R1f:R^d \mapsto R^1 f:Rd↦R1(meaning d-dimensional x mapsto 1-dimensional y), both satisfy:


f ( Theta. x 1 + ( 1 Theta. ) x 2 ) Or less Theta. f ( x 1 ) + ( 1 Theta. ) f ( x 2 ) ( 0 Or less Theta. Or less 1 ) f(\theta x_1+( 1 – \theta)x_2)\le\theta f(x_1)+(1-\theta)f(x_2) (\forall 0 \le\theta \le 1)

Where ∀ X,z∈ψ \forall X, Z ∈ \psi∀ X, Z ∈ψ, ψ\psi indicates the domain

Drawing code:

import numpy as np
import matplotlib.pyplot as plt
x = [i for i in range(20)]
x = np.array(x)
plt.plot(x,x*x*1/2)
plt.show()
Copy the code

The above function is f(x)=12x2f(x) = \frac{1}{2}x^{2}f(x)=21×2.

1.3 the gradient

Multi-dimensional functions x1+x2+x3.. +xn=y x1+x2+x3.. +xn = yx1+x2+x3.. Plus xn equals y is differentiable, and its gradient in all directions is the partial with respect to different x’s.


del f ( x ) = ( partial f ( x ) partial x 1 . partial f ( x ) partial x 2 . . . . . . partial f ( x ) partial x n ) \bigtriangledown f(x) = (\dfrac{\partial f(x)}{\partial x_1},\dfrac{\partial f(x)}{\partial x_2},…. ,\dfrac{\partial f(x)}{\partial x_n})

That’s the gradient.

Then the Taylor expansion of F (x) at x is: f(x)=f(x)+f ‘(x−1)…. f(x) = f(x) + f^{‘}(x-1)…. F (x) = f (x) + f ‘(x – 1)…

For convex functions, on f(x), take two adjacent points x1x_{1}x1 at random, X2x_ {2} x2, might as well set x2 > x1x_ {2} > x_ {1} x2 > x1, the f (x1) f (x_ {1}) f (x1) and f (x1) f (x_ {1}) f (x1) relative f (x2) f (x_ {2}) f (x2) part of the growth is equal to zero F (x2) f (x_ {2}) f (x2).

So what is the portion of the increase?

F ‘(x)f^{‘}(x)f’ (x) and x = x1,x = x2,y =0. (We’re only talking about two dimensions here for convenience)

So when x2x_ {2} x2 tend to infinite x1x_ ‘{1} x1 as f (x) f ^ {‘}’ f (x) (x) is the same f ‘(x1) = f’ (x2) f ^ {‘} (x_ {1}) = f ^ {‘} (x_ {2}) f ‘(x1) = f’ (x2), Time domain [x1, x2] [x_ {1}, x_ {2}] [x1, x2] all the corresponding slope to f ‘(x1). So the f ^ {‘} (x_ {1}). So the f ‘(x1). So the f ^ {‘} (x_ {1}) take a ride on a ride x_ {2} – x_ {1} equals equals equals f (x_ {1}) relative relative relative f (x_ {2}) So the part that’s going to grow is going to be equal to the part that’s going to grow is going to be equal to f of x_{2} $.

But when x1x_ {1} x1 and x2x_ {2} x2 not so close, the derivative of convex function is grows, that is to say, f ‘(x1) is less than f ^ {‘} (x_ {1}) is less than f’ (x1) is less than f ^ {‘} (x_ {2}). Then f (x1) f (x_ {1}) f (x1) relative f (x2) f (x_ {2}) f (x2) part of the growth is greater than the f ‘(x1) f ^ {‘} (x_ {1}) f’ (x1) by x2 – x1x_ {2} – x_ {1} x2 – x1.

$f(x) = \frac{1}{2}x^{2} $

Which is the following:


f ( x 2 ) p f ( x 1 ) + del f ( x 1 ) T ( x 2 x 1 ) f(x_2)\ge f(x_1) + \bigtriangledown f(x_1) ^T (x_2 -x_1)

So we know that the first Order Taylor expansion of f(x) at any point in our domain is the lower bound.

1.4 Strongly convex functions

We are doing machine learning, so we need to be strict, not only to ensure that the function is convex, ensure that the curve of the function is above the tangent line, but also to ensure that the tangent line and the function is a certain distance, this distance is not fixed, but also by a function to get.

A function that conforms to the following formula is called λ−\lambda-λ− strongly convex:


f ( x 2 ) p f ( x 1 ) + del f ( x 1 ) T ( x 2 x 1 ) + Lambda. 2 x 2 x 1 T f(x_2)\ge f(x_1) + \bigtriangledown f(x_1) ^T (x_2 -x_1) +\dfrac{\lambda }{2}|| x_2 – x_1||^T

∀ x, z ∈ bits \ forall x, z ∈ \ psi ∀ x, z ∈ bits (bits \ psi bits of said domain), ∃ lambda ∈ R + {\ exists \ lambda ∈ R_ {+}} ∃ lambda ∈ R +.

That is, the increment of the function must be greater than a certain number (the number is not a constant, but derived from a certain function), and the corresponding convex function formula becomes:


f ( Theta. x 1 + ( 1 Theta. ) x 2 ) Or less Theta. f ( x 1 ) + ( 1 Theta. ) f ( x 2 ) Lambda. Θ 2 ( 1 Θ ) x 2 x 1 ( 0 Or less Theta. Or less 1 ) f(\theta x_1+( 1 – \theta)x_2)\le\theta f(x_1)+(1-\theta)f(x_2) – \dfrac{\lambda \Theta}{2} (1- \Theta) || x_2 – x_1||(\forall 0 \le\theta \le 1)

∀ x, z ∈ bits \ forall x, z ∈ \ psi ∀ x, z ∈ bits (bits \ psi bits of said domain), ∃ lambda ∈ R + {\ exists \ lambda ∈ R_ {+}} ∃ lambda ∈ R +.

1.5 l – Lipschits continuous

If f of x varies locally by no more than a certain magnitude, x1x_1x1 is infinitely close to x2x_2x2.

So ∃ ∈ R + l \ exists l ∈ R_ +} {∃ ∈ R + l make ∀ x1, x2 ∈ bits \ forall x_ {1}, x_ {2} ∈ \ psi ∀ x1, x2 ∈ bits (bits \ psi bits of collection) are:


f ( x 2 ) f ( x 1 ) Or less l x 2 x 1 f(x_2) – f(x_1) \le l||x_2 -x_1| |

The function that satisfies the above property is called l− smooth L – smooth L (x)

1.6 the Hessian matrix

The Hessian matrix is the matrix of the second derivative of the function f(x) over its domain:


del 2 f ( x ) R d R d \bigtriangledown^2 f(x) \in R^d *R^d

If the domain of f(x) is convex and the second derivative is greater than 0, the mathematical expression is:


del 2 f ( x ) 0 \bigtriangledown^2 f(x)\succeq0

Then f of x is convex, and the Hessian matrix is semidefinite.

1.7 Mathematical changes without changing concavity

  • F is convex, and so is f of Ax plus B

  • f1,f2.. fnf_1,f_2.. f_nf1,f2.. F sub n is convex, omega 1, omega 2… N acuity 0 \ omega omega_1 \ omega_2… 2 \ omega_n \ geq 0 1 omega, omega… N omega acuity 0 f (x) = ∑ I = 1 n omega NFN \ sum_ {I = 1} ^ {n} \ omega_nf_n ∑ I = 1 n omega NFN convex function.

  • f1,f2.. fnf_1,f_2.. f_nf1,f2.. Fn is a convex function,f (x)= Max {f1(x),f2(x).. fn(x)}f(x) = max \{ f_1(x),f_2(x).. f_n(x) \} f(x)=max{f1(x),f2(x).. Fn (x)} is also convex

  • ∀ z ∈ X \ forall z ∈ X ∀ z ∈ Xf (X, z) is the convex function with respect to X, then g (X) = supz ∈ Xf (X, z) g (X) = sup_ {z ∈ X} f (X, z) g (X) = supz ∈ Xf (X, z) is of convex function of X, (ginseng, Any z, or any fixed z, has an x that sets f(x,z) to the upper bound.

1.8 Conjugate Function

F: Rd ↦ Rf: R ^ d \ mapsto R f: Rd ↦ R is defined as: f ∗ (z) = sup (zTx – f (x)) * (z) = sup f_ (z ^ Tx – f (x)) f ∗ (z) = sup (zTx – f (x))

The conjugate function is the maximum difference between the linear function zTxz^TxzTx and f of x.

Such as: F (x) = 12 x2f (x) = \ frac {1} {2} x ^ {2} f (x) = 21 x2, conjugate function f ∗ f_ (z) * (z) f ∗ (z) in z = 2 (z) fixed values, is equal to 2 x and f (x) the biggest difference between, because is a convex function, F ‘(x)f^{‘}(x)f’ (x) is monotonically increasing, and the difference between f(x) and 2x is increasing when f ‘(x)<2f^{‘}(x) <2f’ (x)<2. In other words, the derivative of 2x is larger when x<2, and 2x varies more. When ‘f (x) > 2 f ^ {‘}’ > 2 f (x) (x) > 2 f (x) when the derivative of a larger, f (x) change more, difference in smaller, so the ‘f (x) = 2 f ^ {‘} (x) = 2’ f (x) = 2 when the biggest difference.

Conjugate functions have good properties:

  • Whether the original function is convex or not, the conjugate function must be convex.
  • If the original function is differentiable, then:

f ( del f ( x ) ) = del f ( x ) T x f ( x ) = [ f ( x ) + del f ( x ) T ( 0 x ) ] f_*( \bigtriangledown f(x)) = \bigtriangledown f(x)^Tx-f(x)=-[f(x)+ \bigtriangledown f(x)^T(0-x)]

2 Fundamentals of Optimization

2.1 What is optimization

I’m sure you’ve all learned that extremum, our extremum is an optimal solution that we’re looking for.

The optimal solution of the problem can be expressed as:


f ( x 0 ) Or less f ( x ) . x Ω F (x_0) \ le f (x), x ∈ Ω

So x 0 is the minimum, and the maximum is similar, and optimization is finding that minimum.

A function can have multiple maxima, and we want to find the maxima in the function, but some algorithms like gradient descent can only find a local optimal solution, which is the local maximum, not necessarily the whole maximum.

If both the objective function and the constraint function are convex, then the problem becomes convex optimization.

2.2 Main problem and dual problem

The main problem is to explicitly list m inequality constraints and n equality constraints. Written as:


min x f ( x ) s . t . h i ( x ) p 0 ( i [ m ] ) . g j ( x ) p 0 ( j [ n ] ) . \substack{\min\\x}\quad\quad\quad\quad f(x) \\ \quad\quad\quad\quad\quad s.t. \quad\quad\quad h_{i}(x) \geq 0 \ quad (I ∈ [m]), \ \ \ quad, quad, quad, quad, quad, quad, quad, quad, quad g_ {j} (x) \ geq 0 \ quad (j ∈ [n]).

When the main problem is difficult to solve, but its dual problem is easy to solve, and by solving the dual problem can get the optimal solution of the original problem.

How do principal (primal) problems transform dual problems?

(Dividing line) The original problem The dual problem
The decision variables N; > = 0; < = 0; unconstrained N; > = 0; < = 0; = 0
Linear constraints M; < = 0; > = 0; = 0 M; > = 0; < = 0; unconstrained
The target Max, the coefficients are linear constraint constants in duality problems Find min. The coefficients are linear constraint constants in the original problem

Example:

  1. Xiao Ming has a printer, which he can use to make money.
  2. Rent the printer to Xiao Yang and make money by renting it out.

So from Ming’s point of view, whatever the conditions are, Ming wants to maximize his profit, and from Yang’s point of view, Yang wants to maximize his income, in combination with the existing conditions, which is the minimum rent that Yang can pay. So that’s a pair of dual problems.

2.3 Lagrange duality

Why is it necessary to use Lagrange duality to change the original problem into a dual problem?

Answer: no matter what form the original problem is, its dual problem is convex.

Why does a dual function have to be concave?

A:

(This is what I wrote at CSDN.)

The proof uses the following formula:


min { a i + b j } Or less min a i + min b j          i . j N + \ min \ {a_i + b_j \} \ le \ + \ min min a_i b_j \ \ \ \ \ \ \ \ I, j ∈ N ^ +

The following diagram illustrates that dual functions are concave

Drawing code:

import numpy as np
import matplotlib.pyplot as plt
a = np.linspace(0.10.1000)
y = 2*a
z = 3*a- 20
q = -9*a+ 12
x = a
plt.figure(figsize=(8.4))
plt.plot(x,y,color="black",linewidth=2)
plt.plot(x,y,x,z,x,q,color="black",linewidth=2)
plt.plot(x,q,color="black",linewidth=2)
plt.xlabel("Alpha")
plt.ylabel("L")
plt.title("Example")
plt.show()
Copy the code

Fix alpha \alphaα and find the minimum value of L when alpha \alphaα is the same, which is the red line I drew.

2.4 Generalized Lagrangian functions

λ I \lambda_iλ I and μ\muμ are Lagrangian multiplicators introduced for inequality constraints hi(x)≤0h_i(x) \le 0hi(x)≤0 and gi(x)=0g_i(x) = 0gi(x)=0 respectively


L ( x . Lambda. . mu ) = f ( x ) + i = 0 n Lambda. i h i ( x ) + j = 0 m mu g j ( x ) L(x,\lambda,\mu)=f(x)+\sum_{i=0}^n \lambda_i h_i(x) +\sum_{j=0}^m \mu g_j(x)

The corresponding Lagrangian dual function is:


Г ( Lambda. . mu ) = inf x Bits of L ( x . Lambda. . mu ) = inf x Bits of ( f ( x ) + i = 0 n Lambda. i h i ( x ) + j = 0 m mu j g j ( x ) ) Г (\ lambda, \ mu) = \ substack} {\ inf \ \ x ∈ \ psi L (x, \ lambda, \ mu) = \ substack} {\ inf \ \ x ∈ \ psi (f (x) + \ sum_ {I = 0} ^ n \ lambda_i h_i (x) +\sum_{j=0}^m \mu_j g_j(x))

(x) 0 or less because hi h_i (x) (x) \ le 0 hi with gi (x) = 0 0 or less g_i (x) = 0 gi (x) = 0, so the ∑ I = 0 n lambda ihi (x) + ∑ j = 0 m mu gj (x) 0 or less \ sum_ {I = 0} ^ n \ lambda_i h_i (x) + \ sum_ {j = 0} ^ m \ mu g_j \ le ∑ 0 (x) = 0 I n lambda ihi (x) + ∑ j = 0 m mu gj (x) 0 or less

For x∈ψx∈ \psix∈ψ : Г (lambda, mu) = inf ⁡ x ∈ bits of L (x, lambda, mu) or less L (x, lambda, mu) f (x) or less Г (\ lambda, \ mu) = \ substack} {\ inf \ \ x ∈ \ psi L (x, \ lambda,, mu), le L (x, \ lambda,, mu), le F (x) Г (lambda, mu) = infx ∈ bits of L (x, lambda, mu) or less L (x, lambda, mu) f (x) or less

So rui (elm) dual function Г (lambda, mu) Г (\ lambda, \ mu) Г (lambda, mu) gives the L (lambda, mu) L (\ lambda, \ mu) L (lambda, mu) objective function f (x) the optimal value (set to p p ∗ ∗ p ^ *) of the lower bound.

Namely ∀ lambda ⪰ 0 \ forall \ lambda \ ∀ lambda ⪰ succeq 0 0 have Г (lambda, mu) or less p ∗ Г (\ lambda, \ mu) \ le p ^ * Г acuities were p ∗ (lambda, mu).

A dual function Г (lambda, mu) Г (\ lambda, \ mu) Г (lambda, mu) is the optimal value of objective function d ∗ ∗ d ^ * d,, d ∗ d d ∗ is p ∗ p ^ ^ * * p ∗ lower bound, namely:


d Or less p d^* \le p^*

This is weak duality


d = p d^* = p^*

This is strong duality, strong duality is generally not true, but if the main problem is a convex optimization problem, and at least one place in the feasible domain makes the inequality constraint strictly true, then strong duality is true.

2.5 KKT conditions

The KKT condition is mainly used to describe the relationship between the main problem and the optimal solution of the dual problem. Let x∗x^*x∗ be the optimal solution of the main problem, (λ∗,μ∗)(\lambda^*,\mu^*)(λ∗,μ∗) be the optimal solution of the dual problem, when the strong duality holds:


f ( x ) = Г ( Lambda. . mu ) = inf x Bits of L ( x . Lambda. . mu ) Or less f ( x ) + i = 0 n Lambda. i h i ( x ) + j = 0 m mu j g j ( x ) Or less f ( x ) F (x ^ *) = Г (\ lambda ^ *, \ mu ^ *) = \ substack} {\ inf \ \ x ∈ \ psi L (x, \ lambda, \ mu) + \ \ le f (x ^ *) sum_ {I = 0} ^ n \ lambda ^ * _i h_i (x ^ *) +\sum_{j=0}^m \mu^*_j g_j(x^*) \le f(x^*)

So the inequality should be equal, so the following two conditions should be true:

  • Complementary relaxation conditions:

Lambda. h i ( x ) = 0     i [ m ] ( Because it is m Duck inequality ) \lambda^*h_i(x^*) = 0 \ \ \ I ∈[m]

That is to say at this time of the lambda ∗ > 0 \ lambda ^ * > 0 lambda ∗ ∗ (x) = 0 > 0 hi h_i hi ^ * (x) = 0 or lambda ∗ ∗ (x) = 0 indicates 0 \ lambda \ ^ * neq 0 lambda ∗  h_i ∗ (x) = 0 = 0 hi hi ^ * (x) = 0 ∗ (x) = 0

  • X ∗x^*x∗ is an optimal solution to the following problem:

x = arg min x Bits of L ( x . Lambda. . mu ) = arg min x Bits of f ( x ) + i = 0 n Lambda. i h i ( x ) + j = 0 m mu j g j ( x ) X ^ * = \ substack {\ arg \ min \ \ x ∈ \ psi} L (x, \ lambda ^ *, \ mu ^ *) = \ substack {\ arg \ min \ \ x ∈ \ psi} f (x) + \ sum_ {I = 0} ^ n \ lambda ^ * _i h_i(x) +\sum_{j=0}^m \mu^*_j g_j(x)

Usually bits \ psi bits for the complete or ∗ ∗ x x ^ * x is located at the bits of \ psi bits inside so the Lagrange function L (x, lambda, mu) L (x, \ lambda, \ mu) L (x, lambda, mu) ∗ in x x ^ * x ∗ in gradient is 0, that is:


f ( x ) + i = 0 n Lambda. i h i ( x ) + j = 0 m mu j g j ( x ) = 0 \nabla f(x^*)+\sum_{i=0}^n \lambda_i \nabla h_i(x^*) +\sum_{j=0}^m \mu_j \nabla g_j(x^*) = 0

Therefore, KKT conditions are composed of the following parts:

  1. Main problem constraints: Hi ∗ (x) or 0 (I ∈ [m]) h_ {I} (x ^ *) \ \ geq 0 quad (I ∈ [m]) hi ∗ (x) or 0 (I ∈ [m]) and gj ∗ (x) = 0 (I ∈ [m]) g_ {j} \ ^ * (x) = 0 quad (I ∈ [m]) gj ∗ (x) = 0 (I ∈ [m])
  2. Constraints for the dual problem: λ∗⪰0 \lambda^* \succeq 0λ∗⪰0
  3. Complementary slack condition: lambda ∗ hi ∗ (x) = 0 (I ∈ [m]) \ lambda ^ * h_ {I} \ ^ * (x) = 0 quad (I ∈ [m]) lambda ∗ hi ∗ (x) = 0 (I ∈ [m])
  4. The gradient of the Lagrangian function at x∗x^*x∗ is 0: ∗ ∇ f (x) + ∑ I = 0 n lambda I ∗ (x) + ∑ ∇ hi j = 0 m mu j ∇ gj ∗ (x) = 0 \ nabla f (x ^ *) + \ sum_ {I = 0} ^ n \ lambda_i \ nabla h_i (x ^ *) + \ sum_ {j = 0} ^ m \ mu_j \ nabla G_j ∇ ^ * (x) = 0 f (x) ∗ + ∑ I = 0 n lambda I ∗ (x) + ∑ ∇ hi j = 0 m mu j ∇ gj ∗ (x) = 0

KKT conditions have the following important properties:

  • When strong duality holds, KKT condition is a necessary condition for optimal solution for any optimization problem.
  • For convex optimization problems, KKT condition is sufficient condition, that is, the solution satisfying KKT condition must be optimal.
  • For convex optimization problems with strong duality, the KKT condition is sufficient and necessary, that is, if X ∗x^*x∗ is an optimal solution of the original problem if and only if (λ∗,μ∗)(\lambda^*,\mu^*)(λ∗,μ∗) satisfies the KKT condition.

3. Reference:

Document [1]. The Matplotlib

[2] Zhou Zhihua, WANG Wei, Gao Wei, Zhang Lijun, Machine Learning Theory Guide