SUBGRADIENT DESCENT


Subgradient descent is an extension of gradient descent for non-smooth convex functions. Unlike classical gradient descent, which requires differentiability, subgradient descent allows optimization over functions that have kinks or discontinuities in their gradients.


The Subgradient Descent Algorithm

Given an initial point:

x0Rn

and a subgradient:

gf(x),

the update rule follows:

x_k+1:=xkγgk,

where gk is any subgradient of f(x) at xk.

Important Note:

Negative subgradients are not necessarily descent directions.

Subgradient Descent Does Not Descend.png

Negative subgradients are not necessarily descent directions.png

This differs from classical gradient descent, where the negative gradient always points towards decreasing function values.


Example of Subgradient Descent

Consider the function:

f(x)=|x1x2|+0.2|x1+x2|

with a subgradient at (1,1):

g(1,1)=(1.2,0.8)

Since f(x) is not smooth, any subgradient could be used at each step, leading to different update directions.

An important property:

f(x_k+1)f(xk)

shows that subgradient descent does not necessarily decrease function values at each step.

Additionally:

\|x\_{k+1} - x^_\| \leq \|x_k - x^_\|

suggests that the iterates remain bounded, even if the function value does not strictly decrease.


How to Choose a Step Size?

In subgradient descent, the step size γk plays a crucial role. There are several common choices:

1. Diminishing Step Size:

A common approach is to use a sequence γk that tends to zero:

x_k+1=xkγk+f(xk)+f(xk),
Remark:

The plus sign (+) in +f(xk) indicates that we are selecting a particular subgradient from the subdifferential set f(x). There are several possible interpretations:

  1. Selecting the steepest subgradient

    • If multiple subgradients exist (e.g., at a nondifferentiable point), we might choose the one with the largest norm or the steepest > descent direction.
    • This is useful in algorithms like normalized subgradient descent to ensure consistent step sizes.
  2. Selecting a specific element in a structured way

    • Some optimization methods require a deterministic or structured choice of subgradient.
    • The plus sign may indicate a particular rule for selecting a subgradient, e.g., the right-hand derivative in one-dimensional cases.
  3. Right subgradient in nonsmooth analysis

    • In some contexts (like directional derivatives), +f(x) may refer to the right subdifferential, meaning we consider > subgradients corresponding to increasing values of x.

where γk satisfies:

γk0,k=0γk=.

Three typical choices:

2. If the optimal function value f(x) is known:

The update rule can be modified as:

x_k+1=xk+f(xk)+f(xk)2f(xk)f(x\*)

3. Using the best observed function value:

Define:

fkbest=min1ikf(xi),

and use:

f\*=minkf(xk).

This ensures convergence by tracking the best function value observed so far.


Convergence of Subgradient Descent

Lipschitz Convex Functions

If f:RnR is convex and B-Lipschitz continuous with a global minimum x, and:

x0x\*R,

then choosing the constant step size:

γ:=RBT

yields:

1T_t=0T1f(xt)f(x\*)RBT.

Strongly Convex Functions

If f:RnR is strongly convex with parameter μ>0, then using the decreasing step size:

γk=2μ(k+1),t>0,

yields:

f(2K(K+1)_i=1Kixi)f(x\*)2B2μ(K+1).

where:

B=maxkgk.

This shows that subgradient descent converges slower than standard gradient descent, but still guarantees progress over time.


Subgradient Descent for Constrained Optimization

The projected subgradient method is used for constrained optimization, where x must remain within a feasible set X:

minf(x),XD(f).

The update step includes a projection onto the feasible set:

  1. Compute:

    yk=xkγkgk,gkf(xk).
  2. Project onto X:

    x_k+1=PX(yk)=PX(xkγkgk).

This ensures that all iterates remain feasible.

A key inequality that governs convergence:

\|x\_{k+1} - x^_\| \leq \|x_k - x^_\|^2 - 2 \gamma_k ( f(x_k) - f^{\text{opt}}) + \gamma_k^2 \|g_k\|^2.

This highlights the trade-off between step size, function values, and convergence.


Summary

This method is widely used in non-smooth optimization problems, such as L1-regularized learning (Lasso) and robust optimization.