Proximal GD Idea Explained

Related to:

Classical Gradient Step for Minimizing $g (x)$

If we were minimizing only $g (x)$ , the standard gradient descent step would be derived from a first-order approximation:

x_{t + 1} = \arg min_{y \in R^{n}} (g (x_{t}) + \nabla g (x_{t})^{T} (y - x_{t}) + \frac{1}{2 γ} ∥ y - x_{t} ∥^{2}) .

Here:

The first two terms are the first-order Taylor expansion of $g (y)$ around $x_{t}$ .
The third term is a quadratic regularization (proximal term) that keeps updates close to $x_{t}$ to prevent large jumps.
Actually, this term is the second-order Taylor expansion of $g (y)$ around $x_{t}$ .

Adding the $h (y)$ Term

When we introduce $h (y)$ , which is possibly non-smooth, we modify the update by keeping the gradient step for $g (x)$ the same but adding $h (y)$ explicitly:

x_{t + 1} = \arg min_{y \in R^{n}} (g (x_{t}) + \nabla g (x_{t})^{T} (y - x_{t}) + \frac{1}{2 γ} ∥ y - x_{t} ∥^{2} + h (y)) .

Now, the function consists of:

A linear approximation of $g (y)$ .
A quadratic proximity term.
The non-smooth function $h (y)$ **.

Completing the Square

Since $g (x_{t})$ is independent of $y$ , we ignore it in the minimization problem. The key step is recognizing that the first two terms can be rewritten using the squared norm:

\nabla g (x_{t})^{T} (y - x_{t}) + \frac{1}{2 γ} ∥ y - x_{t} ∥^{2}

This term is the first-order approximation of $g (x)$ plus a regularization, and it can be rewritten in a more compact form:

\frac{1}{2 γ} {‖ y - (x_{t} - γ \nabla g (x_{t})) ‖}^{2}

Understanding the Completing-the-Square Step

We need to rewrite the quadratic expression:

\nabla g (x_{t})^{T} (y - x_{t}) + \frac{1}{2 γ} ∥ y - x_{t} ∥^{2}

into a squared norm form plus a constant term.

Step-by-Step Breakdown

1. Expand the squared norm

The Euclidean norm squared is given by:

∥ y - x_{t} ∥^{2} = (y - x_{t})^{T} (y - x_{t})

So, we rewrite the term:

\frac{1}{2 γ} ∥ y - x_{t} ∥^{2} = \frac{1}{2 γ} (y - x_{t})^{T} (y - x_{t})

2. Introduce a shift using $γ \nabla g (x_{t})$

We want to introduce a shifted term in the form $y - (x_{t} - γ \nabla g (x_{t}))$ , so we add and subtract $γ \nabla g (x_{t})$ cleverly.

Observe that:

\nabla g (x_{t})^{T} (y - x_{t}) = \frac{1}{γ} γ \nabla g (x_{t})^{T} (y - x_{t})

This suggests rewriting the term as:

\nabla g (x_{t})^{T} (y - x_{t}) = \frac{1}{γ} (y - x_{t})^{T} (γ \nabla g (x_{t}))

3. Expand the squared norm

Consider the squared term:

∥ y - (x_{t} - γ \nabla g (x_{t})) ∥^{2}

Expanding it:

(y - (x_{t} - γ \nabla g (x_{t})))^{T} (y - (x_{t} - γ \nabla g (x_{t})))

Breaking it down:

(y - x_{t} + γ \nabla g (x_{t}))^{T} (y - x_{t} + γ \nabla g (x_{t}))

Expanding using the identity $(a + b)^{T} (a + b) = a^{T} a + 2 a^{T} b + b^{T} b$ :

∥ y - x_{t} ∥^{2} + 2 γ \nabla g (x_{t})^{T} (y - x_{t}) + γ^{2} ∥ \nabla g (x_{t}) ∥^{2}

Dividing everything by $2 γ$ :

\frac{1}{2 γ} ∥ y - x_{t} ∥^{2} + \nabla g (x_{t})^{T} (y - x_{t}) + \frac{γ}{2} ∥ \nabla g (x_{t}) ∥^{2}

Thus:

\begin{aligned} \frac{1}{2 γ} ∥ y - x_{t} ∥^{2} + \nabla g (x_{t})^{T} (y - x_{t}) + \frac{γ}{2} ∥ \nabla g (x_{t}) ∥^{2} & = \frac{1}{2 γ} ∥ y - (x_{t} - γ \nabla g (x_{t})) ∥^{2} \\ \frac{1}{2 γ} ∥ y - x_{t} ∥^{2} + \nabla g (x_{t})^{T} (y - x_{t}) & = \frac{1}{2 γ} ∥ y - (x_{t} - γ \nabla g (x_{t})) ∥^{2} - \frac{γ}{2} ∥ \nabla g (x_{t}) ∥^{2} \end{aligned}

4. Rearrange the expression

Comparing with the original form:

\nabla g (x_{t})^{T} (y - x_{t}) + \frac{1}{2 γ} ∥ y - x_{t} ∥^{2}

We get:

\frac{1}{2 γ} ∥ y - (x_{t} - γ \nabla g (x_{t})) ∥^{2} - \frac{γ}{2} ∥ \nabla g (x_{t}) ∥^{2}

This is the final transformed expression!

BUT! The final formula looks like this:

x_{t + 1} = \arg min_{y \in R^{n}} (\frac{1}{2 γ} ∥ y - (x_{t} - γ \nabla g (x_{t})) ∥^{2} + h (y))

This is beacuse the term $\frac{γ}{2} ∥ \nabla g (x_{t}) ∥^{2}$ is a constant with respect to $y$ and can be ignored in the minimization problem :)

Proximal GD Idea Explained

Classical Gradient Step for Minimizing g(x)

Adding the h(y) Term

Completing the Square

Understanding the Completing-the-Square Step

Step-by-Step Breakdown

1. Expand the squared norm

2. Introduce a shift using γ∇g(xt)

3. Expand the squared norm

4. Rearrange the expression

Classical Gradient Step for Minimizing $g (x)$

Adding the $h (y)$ Term

2. Introduce a shift using $γ \nabla g (x_{t})$