STOCHASTIC GRADIENT DESCENT
Stochastic Gradient Descent (SGD)
In many optimization problems, the objective function is structured as a sum over individual components:
where each
Computing the full gradient
To address this, we use Stochastic Gradient Descent (SGD).
Stochastic Gradient Descent Algorithm
The SGD update rule is:
-
Initialize
-
For
- Randomly sample one index
- Compute the stochastic gradient
- Update:
where
is the step size (learning rate). - Randomly sample one index
💡 Key Idea:
Instead of using the full gradient, we update using only one random
🔹 Advantage: Faster updates
🔹 Disadvantage: Noisy updates (fluctuations)
📌 Only update with the gradient of
Expected Gradient in SGD
We define the stochastic gradient:
On expectation, this gives:
Thus, SGD provides an unbiased estimate of the true gradient.
Using the Partition Theorem, we get:
✔ Conclusion:
A lower bound holds in expectation!
Convergence of SGD with Bounded Gradients
Theorem
Let
Assume:
- The stochastic gradient is bounded:
Choosing a constant step size:
The SGD error bound:
🔹 Implication: SGD has a sublinear convergence rate
SGD with Strong Convexity
If
then SGD achieves a faster rate:
where
Almost same result as for subgradient descent, but in expectation.
🚀 Faster Convergence! 🚀
Stochastic Subgradient Descent
For non-differentiable functions, we replace the gradient with a subgradient:
The update rule remains:
📌 Works even when
Projected Stochastic Gradient Descent
For constrained problems, we project the update onto the feasible set
where
Summary
| SGD Variant | Update Rule | Convergence Rate |
|---|---|---|
| Standard SGD | ||
| Strongly Convex SGD | ||
| Subgradient SGD | Works for non-smooth |
|
| Projected SGD | Constrained optimization |
📌 Key Takeaways:
✔ SGD is computationally efficient 🚀
✔ Uses random gradient updates instead of full dataset 📊
✔ Converges in expectation (but with noise) 📉
✔ Stronger assumptions → Faster convergence 🏎