STOCHASTIC GRADIENT DESCENT

Stochastic Gradient Descent (SGD)

In many optimization problems, the objective function is structured as a sum over individual components:

f (x) = \frac{1}{m} \sum_{i = 1}^{m} f_{i} (x) .

where each $f_{i}$ represents the cost associated with an observation in a dataset of size $m$ .

Computing the full gradient $\nabla f (x)$ for large datasets can be computationally expensive 😟

To address this, we use Stochastic Gradient Descent (SGD).

Stochastic Gradient Descent Algorithm

The SGD update rule is:

Initialize $x_{0} \in R^{n}$
For $k = 0, 1, 2, \dots$
- Randomly sample one index $i_{k} \in {1, \dots, m}$
- Compute the stochastic gradient $\nabla f_{i_{k}} (x_{k})$
- Update:
$x_{k + 1} = x_{k} - γ_{k} \nabla f_{i_{k}} (x_{k})$
where $γ_{k} > 0$ is the step size (learning rate).

💡 Key Idea:

Instead of using the full gradient, we update using only one random $f_{i}$ at each step!

🔹 Advantage: Faster updates
🔹 Disadvantage: Noisy updates (fluctuations)

📌 Only update with the gradient of $f_{i}$ !

Expected Gradient in SGD

We define the stochastic gradient:

g_{k} := \nabla f_{i_{k}} (x_{k}) .

On expectation, this gives:

E [g_{k} | x_{k}] = \frac{1}{m} \sum_{i = 1}^{m} \nabla f_{i} (x) = \nabla f (x) .

Thus, SGD provides an unbiased estimate of the true gradient.

Using the Partition Theorem, we get:

E [g_{k}^{T} (x_{k} - x^{*})] = E [\nabla f (x_{k})^{T} (x_{k} - x^{*})] \geq E [f (x_{k}) - f (x^{*})] .

✔ Conclusion:
A lower bound holds in expectation!

Convergence of SGD with Bounded Gradients

Theorem
Let $f : R^{n} \to R$ be convex and differentiable with a global minimum $x^{*}$ .
Assume:

$∥ x_{0} - x^{*} ∥ \leq R$
The stochastic gradient is bounded: $E [∥ g_{k} ∥^{2}] \leq B^{2}$

Choosing a constant step size:

γ := \frac{R}{B \sqrt{T}}

The SGD error bound:

\frac{1}{K} \sum_{k = 0}^{K - 1} E [f (x_{k}) - f (x^{*})] \leq \frac{R B}{\sqrt{K}} .

🔹 Implication: SGD has a sublinear convergence rate $O (\frac{1}{\sqrt{K}})$ .

SGD with Strong Convexity

If $f$ is strongly convex with parameter $μ > 0$ , and we use a decreasing step size:

γ_{k} := \frac{2}{μ (k + 1)}

then SGD achieves a faster rate:

E [f (\frac{2}{K (K + 1)} \sum_{k = 1}^{K} k x_{k}) - f (x^{*})] \leq \frac{2 B^{2}}{μ (K + 1)} .

where $B^{2} := {max}_{k = 1}^{T} E [∥ g_{k} ∥^{2}]$ .

Almost same result as for subgradient descent, but in expectation.

🚀 Faster Convergence! 🚀

Stochastic Subgradient Descent

For non-differentiable functions, we replace the gradient with a subgradient:

g_{k} \in \partial f_{i_{k}} (x_{k}) .

The update rule remains:

x_{k + 1} = x_{k} - γ_{k} g_{k} .

📌 Works even when $f (x)$ is not smooth!

Projected Stochastic Gradient Descent

For constrained problems, we project the update onto the feasible set $X$ :

x_{k + 1} = P_{X} (x_{k} - γ_{k} g_{k}) .

where $P_{X}$ is the projection operator. This is called Projected SGD.

Summary

SGD Variant	Update Rule	Convergence Rate
Standard SGD	$x_{k + 1} = x_{k} - γ_{k} \nabla f_{i_{k}} (x_{k})$	$O (\frac{1}{\sqrt{K}})$
Strongly Convex SGD	$x_{k + 1} = x_{k} - \frac{2}{μ (k + 1)} \nabla f_{i_{k}} (x_{k})$	$O (\frac{1}{K})$
Subgradient SGD	$x_{k + 1} = x_{k} - γ_{k} g_{k}$	Works for non-smooth $f$
Projected SGD	$x_{k + 1} = P_{X} (x_{k} - γ_{k} g_{k})$	Constrained optimization

📌 Key Takeaways:
✔ SGD is computationally efficient 🚀
✔ Uses random gradient updates instead of full dataset 📊
✔ Converges in expectation (but with noise) 📉
✔ Stronger assumptions → Faster convergence 🏎