STEEPEST GRADIENT DESCENT

Steepest gradient descent is a variation of the gradient descent method where the step size is not fixed but instead chosen dynamically at each iteration. This approach ensures that the algorithm takes the most efficient step towards the minimum in each iteration.

Update Rule

The update rule for steepest gradient descent is given by:

x^{k + 1} = x^{k} - γ_{k} \nabla f (x^{k}),

where the step size $γ_{k}$ is chosen to minimize the function along the descent direction:

γ_{k} = \arg min_{γ > 0} f (x^{k} - γ \nabla f (x^{k})) (*)

This means that at each iteration, we find the optimal step size $γ_{k}$ by minimizing the function along the direction of the negative gradient.

Geometric Interpretation

Steepest gradient descent.png|700

The two images illustrate the difference between fixed step size gradient descent and steepest gradient descent:

Fixed Step Size in Gradient Descent:
- The step size is constant across all iterations, leading to potential inefficiencies in convergence.
Adaptive Step Size in Steepest Gradient Descent:
- The step size is chosen optimally at each iteration, ensuring more efficient movement towards the minimum.

Convergence Properties

Theorem:
Let ${x^{k}}$ be a convergent sequence generated by the steepest descent algorithm applied to a function $f$ . Then, in the worst case, the order of convergence of ${x^{k}}$ is 1.

This means that in the worst case, steepest gradient descent has a linear rate of convergence.

What Does This Mean?

The order of convergence describes how fast the error decreases as the algorithm progresses.
If the order of convergence is 1, it means the error decreases linearly in the worst case.
This implies that steepest gradient descent is not always very fast—it may take many iterations to reach an accurate solution, depending on the function.

Modification: Step Size Reduction (Step Shredding)

A practical modification of steepest gradient descent involves step size reduction, also called step shredding. Here, the step size is adjusted dynamically based on a condition:

f (x^{k + 1}) \leq f (x^{k}) - ε γ ∥ \nabla f (x^{k}) ∥^{2} (* *)

where $ε \in (0, 1)$ is a pre-selected method parameter, that controls the minimum required decrease in function value.

Remark

Very often, $ε$ is set to a small value, such as $0.1$

This condition ensures that each step results in sufficient function decrease.

If the condition is not satisfied, we reduce the step size $γ_{k}$ using a reduction factor $δ \in (0, 1)$ :

γ_{k} = δ γ_{k} .

The parameter $δ$ determines how aggressively the step size is reduced when the descent condition is not met.

Remark

Very often, the reduction factor $δ$ is set to a small value, such as $0.5$ .

Why is this modification useful?

If $γ_{k}$ is too large, we may overshoot the minimum, leading to oscillations or divergence.
If $γ_{k}$ is too small, convergence becomes slow and inefficient.
The step reduction rule ensures we adapt to the function landscape, preventing large, ineffective updates while still making steady progress.

Algorithm for Modified Steepest Descent

Initialize $x^{0} \in R^{n}$ , an arbitrary step size $γ$ (the same in all iterations), and parameters $ε, δ \in (0, 1)$ .
For each iteration $k + 1$ :

Set initial $γ_{k} = γ$ .
Compute the update:
$x^{k + 1} = x^{k} - γ_{k} \nabla f (x^{k}) .$
Check if the inequality:
$f (x^{k + 1}) \leq f (x^{k}) - ε γ_{k} ∥ \nabla f (x^{k}) ∥^{2}$
is satisfied.
If satisfied: accept $γ_{k}$ and move to the next iteration.
If not satisfied: reduce step size $γ_{k} = δ γ_{k}$ and repeat.

This method ensures that the step size is sufficiently large for fast convergence while also ensuring descent in function value.

Practical Insights

The modification using step shredding makes steepest gradient descent more robust.
Requirement $(* *)$ is stricter than $(*)$ , but both ensure function decrease.
This version of gradient descent is widely used in practice due to its adaptive nature.

Steepest gradient descent dynamically adapts the step size, ensuring that the optimal step length is chosen at each iteration, leading to faster and more efficient convergence.

Simple Example of the Algorithm

Finding the Minimum of $f (x) = x^{2}$

Let’s minimize:

f (x) = x^{2}

using Modified Steepest Descent.

Step 1: Initialize Parameters

Choose an initial point: $x^{0} = 5$
Initial step size: $γ = 1$
Set $ε = 0.1$ , $δ = 0.5$ .

Step 2: Iterate

Compute Gradient:
$\nabla f (x) = 2 x$
At $x^{0} = 5$ :
$\nabla f (5) = 10$
Compute Update:
$x^{1} = x^{0} - γ \nabla f (x^{0}) = 5 - 1 (10) = - 5$
Check Descent Condition:
$f (x^{1}) \leq f (x^{0}) - ε γ ∥ \nabla f (x^{0}) ∥^{2}$ $(- 5)^{2} \leq 5^{2} - 0.1 (1) (10^{2})$ $25 \leq 25 - 10$ $25 \leq 15$ $25 ≰ 15$

Since the condition fails, we reduce $γ$ :

γ_{1} = δ γ = 0.5 (1) = 0.5

Then repeat the update with the smaller step size.

Step 3: New Update with Reduced $γ$

Compute New Update:
$x^{1} = 5 - 0.5 (10) = 0$
Check Descent Condition Again:
$f (0) \leq 5^{2} - 0.1 (0.5) (10^{2})$ $0 \leq 25 - 5$ $0 \leq 20$ $0 \leq 20$

Since the condition holds, we accept $x^{1} = 0$ and move to the next iteration.

STEEPEST GRADIENT DESCENT

Update Rule

Geometric Interpretation

Convergence Properties

Modification: Step Size Reduction (Step Shredding)

Algorithm for Modified Steepest Descent

Practical Insights

Simple Example of the Algorithm

Finding the Minimum of f(x)=x2

Step 1: Initialize Parameters

Step 2: Iterate

Step 3: New Update with Reduced γ

Finding the Minimum of $f (x) = x^{2}$

Step 3: New Update with Reduced $γ$