14. PROXIMAL GRADIENT DESCENT


Proximal Gradient Descent is an extension of gradient descent for optimizing composite functions that consist of a smooth function and a possibly non-smooth function. It generalizes gradient descent by incorporating a proximal step that accounts for non-smooth regularization terms.


Composite Optimization Problems

We consider optimization problems of the form:

f(x):=g(x)+h(x)

where:

The challenge:


Idea of Proximal Gradient Descent

The standard gradient descent update for minimizing a smooth function g(x) is:

xt+1=argminyRn(g(xt)+g(xt)T(yxt)+12γyxt2)

For composite functions f(x)=g(x)+h(x), we modify the update step to include h(x):

xt+1=argminyRn(g(xt)+g(xt)T(yxt)+12γyxt2+h(y))

We just added the non-smooth term h(y) to the objective function.

Rewriting this, we get the proximal gradient descent update:

xt+1=argminyRn(12γy(xtγg(xt))2+h(y))
Note!

Here is the explanation of this transformation: Proximal GD Idea Explained


Proximal Gradient Descent Algorithm

An iteration of proximal gradient descent is defined as:

xt+1:=proxh,γ(xtγg(xt))

where proxh,γ is the proximal mapping for a given function h and parameter γ>0.

Steps

  1. Gradient Descent Step:
    Compute z=xtγg(xt) (just like in gradient descent)

  2. Proximal Minimization:

    Compute the proximal operator:

    xt+1=argminy(12γyz2+h(y))

This step ensures that xt+1 remains close to z while also incorporating the non-smooth term h(y).


Proximal Gradient Descent as a Generalization of Gradient Descent

Proximal gradient descent recovers basic gradient descent and projected gradient descent as special cases:


Convergence Rates of Proximal Gradient Descent

The convergence of proximal gradient descent follows the same principles as gradient descent, now extended to non-smooth functions h(x).

If g(x) is convex and L-smooth, and we set:

γk=1L,

then proximal gradient descent satisfies:

f(xk)fL2kx0x2.

This shows that proximal gradient descent converges at a rate of O(1k), similar to standard gradient descent for convex and smooth functions.


Summary

This method is widely used in sparse learning, compressed sensing, and machine learning, where non-smooth regularization terms (e.g., L1-norm in Lasso regression) play a key role in inducing sparsity.