14. PROXIMAL GRADIENT DESCENT

Proximal Gradient Descent is an extension of gradient descent for optimizing composite functions that consist of a smooth function and a possibly non-smooth function. It generalizes gradient descent by incorporating a proximal step that accounts for non-smooth regularization terms.

Composite Optimization Problems

We consider optimization problems of the form:

f (x) := g (x) + h (x)

where:

$g (x)$ is a smooth function (e.g., differentiable with an L-Lipschitz gradient).
$h (x)$ may not be differentiable but is convex.

The challenge:

How do we minimize $f (x)$ when $h (x)$ is not smooth?
We solve this using the proximal gradient descent method!

Idea of Proximal Gradient Descent

The standard gradient descent update for minimizing a smooth function $g (x)$ is:

x_{t + 1} = \arg min_{y \in R^{n}} (g (x_{t}) + \nabla g (x_{t})^{T} (y - x_{t}) + \frac{1}{2 γ} ∥ y - x_{t} ∥^{2})

For composite functions $f (x) = g (x) + h (x)$ , we modify the update step to include $h (x)$ :

x_{t + 1} = \arg min_{y \in R^{n}} (g (x_{t}) + \nabla g (x_{t})^{T} (y - x_{t}) + \frac{1}{2 γ} ∥ y - x_{t} ∥^{2} + h (y))

We just added the non-smooth term $h (y)$ to the objective function.

Rewriting this, we get the proximal gradient descent update:

x_{t + 1} = \arg min_{y \in R^{n}} (\frac{1}{2 γ} ∥ y - (x_{t} - γ \nabla g (x_{t})) ∥^{2} + h (y))

Note!

Here is the explanation of this transformation: Proximal GD Idea Explained

Proximal Gradient Descent Algorithm

An iteration of proximal gradient descent is defined as:

x^{t + 1} := {prox}_{h, γ} (x^{t} - γ \nabla g (x^{t}))

where ${prox}_{h, γ}$ is the proximal mapping for a given function $h$ and parameter $γ > 0$ .

Steps

Gradient Descent Step:
Compute $z = x_{t} - γ \nabla g (x_{t})$ (just like in gradient descent)
Proximal Minimization:

Compute the proximal operator:
$x_{t + 1} = \arg min_{y} (\frac{1}{2 γ} ∥ y - z ∥^{2} + h (y))$

This step ensures that $x_{t + 1}$ remains close to $z$ while also incorporating the non-smooth term $h (y)$ .

Proximal Gradient Descent as a Generalization of Gradient Descent

Proximal gradient descent recovers basic gradient descent and projected gradient descent as special cases:

If $h \equiv 0$ , we recover gradient descent.
If $h \equiv ι_{X}$ (the indicator function of a convex set $X$ ), we recover projected gradient descent, where:
- The indicator function $ι_{X} (x)$ is defined as:
  $ι_{X} (x) := {\begin{cases} 0, & x \in X \\ + \infty, & x \notin X \end{cases}$
- The proximal mapping simplifies to a projection onto $X$ :
  ${prox}_{h, γ} (z) := \arg min_{y} {\frac{1}{2 γ} ∥ y - z ∥^{2} + ι_{X} (y)} = \arg min_{y \in X} ∥ y - z ∥^{2}$

Convergence Rates of Proximal Gradient Descent

The convergence of proximal gradient descent follows the same principles as gradient descent, now extended to non-smooth functions $h (x)$ .

If $g (x)$ is convex and L-smooth, and we set:

γ_{k} = \frac{1}{L},

then proximal gradient descent satisfies:

f (x_{k}) - f^{*} \leq \frac{L}{2 k} ∥ x_{0} - x^{*} ∥^{2} .

This shows that proximal gradient descent converges at a rate of $O (\frac{1}{k})$ , similar to standard gradient descent for convex and smooth functions.

Summary

Proximal gradient descent extends gradient descent to composite functions $f (x) = g (x) + h (x)$ .
It consists of a gradient step for $g (x)$ and a proximal step for $h (x)$ .
It generalizes gradient descent (when $h (x) = 0$ ) and projected gradient descent (when $h (x)$ is an indicator function).
Convergence is similar to gradient descent for smooth functions.

This method is widely used in sparse learning, compressed sensing, and machine learning, where non-smooth regularization terms (e.g., L1-norm in Lasso regression) play a key role in inducing sparsity.