GRADIENT DESCENT

Gradient descent is an iterative optimization algorithm used to find the minimum of a differentiable function. Given a function $f (x) : R^{n} \to R$ that is convex and differentiable, has a global minimum $x^{*}$ . So, we seek to find a point $\hat{x}$ such that:

| f (\hat{x}) - f (x^{*}) | \leq ε .

Remark

We need to find such a point $\hat{x}$ that minimizes the function $f (x)$ within a given tolerance $ε$ .
So, the goal is to find the point $\hat{x}$ that is very close to the global minimum $x^{*}$ .

The key idea is to start from an initial point $x^{0} \in R^{n}$ and iteratively update it using the gradient information to generate a sequence ${x^{k}}_{k = 0, 1, 2, \dots}$ that satisfies:

f (x^{k + 1}) < f (x^{k}) .

Iterative Algorithm

Choose an initial point $x^{0} \in R^{n}$ .
Update rule: The next point is computed using the gradient:
$x^{k + 1} = x^{k} - γ \nabla f (x^{k}) .$
Here, $γ$ is the step size (learning rate).
Repeat this process until a stopping criterion is satisfied.

Geometric Interpretation

Gradient descent follows the steepest descent direction to minimize the function.
In a 3D bowl-like function, each iteration moves the point downhill until reaching the minimum.
The algorithm follows contour lines in 2D, converging towards the optimal point.

gradient descent.png|700

Average Error in Gradient Descent

Over the first $K$ iterations, the error satisfies:

\sum_{k = 0}^{K - 1} (f (x^{k}) - f (x^{*})) \leq \frac{γ}{2} \sum_{k = 0}^{K - 1} ∥ \nabla f (x^{k}) ∥^{2} + \frac{1}{2 γ} ∥ x^{0} - x^{*} ∥^{2} .

Step Size Problems

If $γ$ is too small, the algorithm converges slowly.
If $γ$ is too large, the algorithm may overshoot and fail to converge.

Choosing the Step Size

Theorem 1: Bounded Gradient

For a convex and differentiable function $f (x)$ with a global minimum $x^{*}$ , and assuming:

∥ x - x^{*} ∥ \leq R, ∥ \nabla f (x) ∥ \leq B, \forall x .

Where $B$ is the Lipschitz constant of the gradient.

Choosing the step size:

γ = \frac{R}{B \sqrt{K}},

yields:

\frac{1}{K} \sum_{k = 0}^{K - 1} (f (x^{k}) - f (x^{*})) \leq \frac{R B}{\sqrt{K}} .

Thus, the average error decreases as $O (\frac{1}{\sqrt{K}})$

Theorem 2: Smooth Functions

If $f (x)$ is differentiable and smooth with parameter $L$ , using the step size:

γ = \frac{1}{L},

gradient descent satisfies:

f (x^{k + 1}) \leq f (x^{k}) - \frac{1}{2 L} ∥ \nabla f (x^{k}) ∥^{2} .

So, the function value decreases at each iteration.

Theorem 3: Smooth and Convex Functions

For a convex and differentiable function with smoothness parameter $L$ , choosing:

γ = \frac{1}{L},

ensures:

f (x^{k}) - f (x^{*}) \leq \frac{L}{2 K} ∥ x^{0} - x^{*} ∥^{2} .

This means the function value converges to the minimum at a rate of $O (\frac{1}{K})$

Stopping Criteria

To stop the iteration process, we use one of the following conditions:

Gradient norm is small:
$max_{i} | \frac{\partial f}{\partial x_{i}} | < ε .$
Sum of gradient components is small:
$∥ \nabla f (x^{k}) ∥^{2} = \sum_{i = 1}^{n} {(\frac{\partial f}{\partial x_{i}})}^{2} < ε .$
Function values stop decreasing:
$| f (x^{k + 1}) - f (x^{k}) | < ε .$

These conditions ensure that the algorithm stops when the function is close to the minimum.

Summary

Gradient descent iteratively moves in the direction of steepest descent.
Step size $γ$ determines the trade-off between speed and stability.
Theoretical results guarantee convergence under appropriate conditions.
Stopping criteria ensure we do not perform unnecessary iterations.

Gradient descent is widely used in machine learning, optimization, and deep learning due to its simplicity and efficiency! 🚀