Test 3. Deep Learning

$Question 1$

Describe the computational model of an artificial neuron. Write its mathematical equation. Explain the roles of the input features, weights, bias, and how the neuron’s output is produced.

Answer 1!

An artificial neuron computes a weighted sum of inputs plus a bias, then applies an activation:

> z = \sum_{i = 1}^{d} w_{i} x_{i} + b, y = ϕ (z)

Inputs $x_{i}$ : feature values.
Weights $w_{i}$ : scale each input’s contribution.
Bias $b$ : allows shifting the activation threshold.
Activation $ϕ$ : nonlinearly transforms $z$ into the neuron’s output $y$ .

$Question 2$

Explain the forward pass in a feedforward neural network. How are activations computed layer by layer? Illustrate with a two‑layer (one hidden, one output) network.

Answer 2!

In a forward pass, each layer’s outputs become the next layer’s inputs:

Hidden layer: $> z^{(1)} = W^{(1)} x + b^{(1)}, a^{(1)} = ϕ (z^{(1)})$
Output layer: $> z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}, \hat{y} = ψ (z^{(2)})$

$W^{(l)}, b^{(l)}$ : weights and biases of layer $l$ .
$ϕ, ψ$ : activation functions (e.g., ReLU in hidden, softmax in output).

$Question 3$

Compare common activation functions—sigmoid, tanh, ReLU, and Leaky ReLU. For each, give its formula, output range, and discuss advantages and drawbacks. When might you choose one over another?

Answer 3!

Sigmoid: $σ (z) = 1 / (1 + e^{- z})$ , range $(0, 1)$ .
- Pros: smooth, interpretable as probability.
- Cons: vanishing gradients for $| z | ≫ 0$ .
Tanh: $\tanh (z) = (e^{z} - e^{- z}) / (e^{z} + e^{- z})$ , range $(- 1, 1)$ .
- Pros: zero‑centered.
- Cons: still suffers vanishing gradients.
ReLU: $max (0, z)$ , range $[0, \infty)$ .
- Pros: sparse activation, mitigates vanishing gradient.
- Cons: “dying ReLU” (units stuck at zero).
Leaky ReLU: $max (α z, z)$ with small $α > 0$ , range $(- \infty, \infty)$ .
- Pros: alleviates dying ReLU.
- Cons: adds a small negative slope hyperparameter.
Choice: ReLU/Leaky ReLU in deep nets for faster training; sigmoid/tanh in output layers for bounded outputs.

$Question 4$

Describe gradient descent optimization. Derive the weight‑update rule for a parameter $w$ given loss $L (w)$ . Contrast batch, stochastic, and mini‑batch gradient descent, and name key hyperparameters.

Answer 4!

Update rule: $> w \leftarrow w - η \frac{\partial L}{\partial w}$ where $η$ is the learning rate.
Variants:
- Batch GD: use all $n$ examples to compute gradient (stable but slow).
- Stochastic GD: update per example (noisy but fast).
- Mini‑batch GD: update per batch of size $m$ (balance of noise and speed).
Hyperparameters:
- Learning rate $η$
- Batch size $m$
- Momentum, decay schedules

$Question 5$

Explain the backpropagation algorithm. How does it use the chain rule to compute gradients for all weights in a multilayer network? Outline the main steps.

Answer 5!

Forward pass: compute and cache activations $a^{(l)}$ and pre‑activations $z^{(l)}$ .
Output error:
$δ^{(L)} = \nabla_{a} L ⊙ ϕ^{'} (z^{(L)})$ .
Backpropagate: for $l = L - 1, \dots, 1$ , $> δ^{(l)} = (W^{(l + 1) T} δ^{(l + 1)}) ⊙ ϕ^{'} (z^{(l)})$
Gradient:
$\frac{\partial L}{\partial W^{(l)}} = δ^{(l)} a^{(l - 1) T}, \frac{\partial L}{\partial b^{(l)}} = δ^{(l)}$ .
Update weights via gradient descent.

$Question 6$

Define the vanishing gradient problem. Why does it occur in deep networks? What strategies (architectural or algorithmic) help mitigate it?

Answer 6!

Vanishing gradients: gradients shrink exponentially as they backpropagate through many layers, slowing or stalling learning in early layers.
Causes:
- Activation derivatives $< 1$ (e.g. sigmoid, tanh).
- Poor weight initialization.
Mitigations:
- Use ReLU or its variants.
- He or Xavier initialization.
- Batch normalization.
- Residual connections (ResNets).

$Question 7$

Outline the typical training loop for a neural network. What are the key steps performed each epoch and per batch? How do you integrate data loading, forward pass, loss computation, backward pass, and parameter updates?

Answer 7!

For each epoch:

Shuffle training data.
For each batch:
- Load inputs $X$ and labels $y$ .
- Forward pass: compute $\hat{y} = f (X)$ .
- Compute loss: $L (\hat{y}, y)$ .
- Backward pass: compute gradients $\nabla_{θ} L$ .
- Update parameters $θ \leftarrow θ - η \nabla_{θ} L$ .
Validation: evaluate on hold‑out set.
Logging: track metrics, adjust learning rate or early stop.

$Question 8$

Describe the core components of a convolutional neural network (CNN). What is a convolutional layer? Explain the convolution operation, receptive field, pooling layers, and how CNNs exploit spatial structure.

Answer 8!

Convolutional layer: applies learnable filters $K$ over input feature maps via sliding-window dot products: $> (I * K) (i, j) = \sum_{u, v} I (i + u, j + v) K (u, v)$
Receptive field: region of input influencing a given output unit.
Pooling layer: down‑samples feature maps (e.g. max or average) to reduce spatial size and introduce invariance.
Spatial exploitation: weight sharing (same filter across locations) and local connectivity capture translational features efficiently.