STOCHASTIC GRADIENT DESCENT


Stochastic Gradient Descent (SGD)

In many optimization problems, the objective function is structured as a sum over individual components:

f(x)=1mi=1mfi(x).

where each fi represents the cost associated with an observation in a dataset of size m.

Computing the full gradient f(x) for large datasets can be computationally expensive 😟

To address this, we use Stochastic Gradient Descent (SGD).


Stochastic Gradient Descent Algorithm

The SGD update rule is:

  1. Initialize x0Rn

  2. For k=0,1,2,

    • Randomly sample one index ik{1,,m}
    • Compute the stochastic gradient fik(xk)
    • Update:
    xk+1=xkγkfik(xk)

    where γk>0 is the step size (learning rate).

💡 Key Idea:

Instead of using the full gradient, we update using only one random fi at each step!

🔹 Advantage: Faster updates
🔹 Disadvantage: Noisy updates (fluctuations)

📌 Only update with the gradient of fi!


Expected Gradient in SGD

We define the stochastic gradient:

gk:=fik(xk).

On expectation, this gives:

E[gk|xk]=1mi=1mfi(x)=f(x).

Thus, SGD provides an unbiased estimate of the true gradient.

Using the Partition Theorem, we get:

E[gkT(xkx)]=E[f(xk)T(xkx)]E[f(xk)f(x)].

Conclusion:
A lower bound holds in expectation!


Convergence of SGD with Bounded Gradients

Theorem
Let f:RnR be convex and differentiable with a global minimum x.
Assume:

Choosing a constant step size:

γ:=RBT

The SGD error bound:

1Kk=0K1E[f(xk)f(x)]RBK.

🔹 Implication: SGD has a sublinear convergence rate O(1K).


SGD with Strong Convexity

If f is strongly convex with parameter μ>0, and we use a decreasing step size:

γk:=2μ(k+1)

then SGD achieves a faster rate:

E[f(2K(K+1)k=1Kkxk)f(x)]2B2μ(K+1).

where B2:=maxk=1TE[gk2].

Almost same result as for subgradient descent, but in expectation.

🚀 Faster Convergence! 🚀


Stochastic Subgradient Descent

For non-differentiable functions, we replace the gradient with a subgradient:

gkfik(xk).

The update rule remains:

xk+1=xkγkgk.

📌 Works even when f(x) is not smooth!


Projected Stochastic Gradient Descent

For constrained problems, we project the update onto the feasible set X:

xk+1=PX(xkγkgk).

where PX is the projection operator. This is called Projected SGD.


Summary

SGD Variant Update Rule Convergence Rate
Standard SGD xk+1=xkγkfik(xk) O(1K)
Strongly Convex SGD xk+1=xk2μ(k+1)fik(xk) O(1K)
Subgradient SGD xk+1=xkγkgk Works for non-smooth f
Projected SGD xk+1=PX(xkγkgk) Constrained optimization

📌 Key Takeaways:
SGD is computationally efficient 🚀
Uses random gradient updates instead of full dataset 📊
Converges in expectation (but with noise) 📉
Stronger assumptions → Faster convergence 🏎