Test 4. Regularization

$Question 1$

Define L1 and L2 regularization. Write their penalty terms and explain how each influences model weights. Discuss the geometric interpretation of both and when you might prefer one over the other.

Answer 1!

L1 penalty: $λ \sum_{j} | w_{j} |$
- Encourages sparsity—many weights driven exactly to zero.
- Geometry: diamond‑shaped constraint region, corners align with axes → solutions on axes.
L2 penalty: $\frac{λ}{2} \sum_{j} w_{j}^{2}$
- Encourages small but nonzero weights.
- Geometry: circular (ellipsoidal) constraint region → smooth shrinkage.
Preference:
- Use L1 when you want feature selection or interpretability.
- Use L2 when you want to shrink weights smoothly and handle multicollinearity.

$Question 2$

Describe Lasso regression. Provide its optimization objective. Explain how Lasso can perform variable selection and discuss methods for choosing the regularisation parameter $λ$ .

Answer 2!

Objective: $> min_{w} \frac{1}{2 n} \sum_{i = 1}^{n} (y^{(i)} - w^{T} x^{(i)})^{2} + λ \sum_{j} | w_{j} |$
Variable selection: the L1 penalty drives some coefficients exactly to zero, effectively selecting a subset of features.
Choosing $λ$ :
- Cross‑validation (e.g. k‑fold) to balance bias–variance.
- Information criteria (AIC, BIC) when model likelihood is known.
- Regularisation path (LARS algorithm) to inspect coefficient trajectories.

$Question 3$

Describe Ridge regression. Provide its optimization objective and derive the closed‑form solution. Discuss how Ridge addresses multicollinearity and affects the bias–variance trade‑off.

Answer 3!

Objective: $> min_{w} \frac{1}{2 n} \sum_{i = 1}^{n} (y^{(i)} - w^{T} x^{(i)})^{2} + \frac{λ}{2} \sum_{j} w_{j}^{2}$
Closed‑form: $> w = (X^{T} X + λ I)^{- 1} X^{T} y$
Multicollinearity: adding $λ I$ ensures $X^{T} X + λ I$ is invertible, stabilizing estimates.
Bias–variance: increases bias (shrinks coefficients) but reduces variance, often lowering overall error on unseen data.

$Question 4$

Explain weight decay in the context of neural network training. Show how weight decay modifies the standard gradient descent update. How is weight decay equivalent to L2 regularisation, and what implementation differences should you be aware of?

Answer 4!

Standard GD: $w \leftarrow w - η \nabla_{w} L$
With weight decay: $> w \leftarrow w - η (\nabla_{w} L + λ w) ⟺ w \leftarrow (1 - η λ) w - η \nabla_{w} L$
Equivalence: the extra $λ w$ term is exactly the gradient of $\frac{λ}{2} ∥ w ∥^{2}$ .
Implementation:
- Some optimizers (e.g. AdamW) decouple weight decay from adaptive learning rates to avoid bias.
- Ensure you apply decay to weights only, not to biases or batch‑norm parameters.

$Question 5$

Describe dropout regularisation. For a given layer, explain what happens during training and inference. What problem does dropout address, and how does it approximate model averaging? How do you choose an appropriate dropout rate?

Answer 5!

Training: randomly “drop” each neuron with probability $p$ , i.e., multiply activations by Bernoulli mask $m_{i} \sim Bern (1 - p)$ .
Inference: scale activations by $1 - p$ (or use inverted dropout at training time).
Addresses: co‑adaptation of neurons, reduces overfitting by forcing redundant representations.
Model averaging: dropout samples a subnetwork each batch; inference approximates averaging over all $2^{n}$ subnetworks.
Choosing $p$ : common values 0.2–0.5; tune via validation—higher for large fully‑connected layers, lower for convolutional layers.

$Question 6$

Explain data augmentation as a regularisation technique. Give examples for image and text data. How does augmentation reduce overfitting, and how do you integrate it into the training pipeline?

Answer 6!

Concept: synthetically increase dataset size by applying label‑preserving transformations.
Image examples: random crops, flips, rotations, color jitter, Gaussian noise.
Text examples: synonym replacement, random insertion/deletion, back‑translation.
Effect: exposes model to varied inputs, making it more robust and reducing overfitting.
Integration: apply augmentations on‑the‑fly during data loading (e.g., in PyTorch Dataset or TensorFlow tf.data), ensuring each epoch sees new variants.