GRADIENT DESCENT


Gradient descent is an iterative optimization algorithm used to find the minimum of a differentiable function. Given a function f(x):RnR that is convex and differentiable, has a global minimum x. So, we seek to find a point x^ such that:

|f(x^)f(x)|ε.
Remark

We need to find such a point x^ that minimizes the function f(x) within a given tolerance ε.
So, the goal is to find the point x^ that is very close to the global minimum x.

The key idea is to start from an initial point x0Rn and iteratively update it using the gradient information to generate a sequence {xk}k=0,1,2, that satisfies:

f(xk+1)<f(xk).

Iterative Algorithm

  1. Choose an initial point x0Rn.

  2. Update rule: The next point is computed using the gradient:

    xk+1=xkγf(xk).

    Here, γ is the step size (learning rate).

  3. Repeat this process until a stopping criterion is satisfied.


Geometric Interpretation

gradient descent.png|700


Average Error in Gradient Descent

Over the first K iterations, the error satisfies:

k=0K1(f(xk)f(x))γ2k=0K1f(xk)2+12γx0x2.

Step Size Problems


Choosing the Step Size

Theorem 1: Bounded Gradient

For a convex and differentiable function f(x) with a global minimum x, and assuming:

xxR,f(x)B,x.

Where B is the Lipschitz constant of the gradient.

Choosing the step size:

γ=RBK,

yields:

1Kk=0K1(f(xk)f(x))RBK.

Thus, the average error decreases as O(1K)


Theorem 2: Smooth Functions

If f(x) is differentiable and smooth with parameter L, using the step size:

γ=1L,

gradient descent satisfies:

f(xk+1)f(xk)12Lf(xk)2.

So, the function value decreases at each iteration.


Theorem 3: Smooth and Convex Functions

For a convex and differentiable function with smoothness parameter L, choosing:

γ=1L,

ensures:

f(xk)f(x)L2Kx0x2.

This means the function value converges to the minimum at a rate of O(1K)


Stopping Criteria

To stop the iteration process, we use one of the following conditions:

  1. Gradient norm is small:

    maxi|fxi|<ε.
  2. Sum of gradient components is small:

    f(xk)2=i=1n(fxi)2<ε.
  3. Function values stop decreasing:

    |f(xk+1)f(xk)|<ε.

These conditions ensure that the algorithm stops when the function is close to the minimum.


Summary

Gradient descent is widely used in machine learning, optimization, and deep learning due to its simplicity and efficiency! 🚀