Test 1. Supervised Learning
Explain the k‑Nearest Neighbour (k‑NN) algorithm. What are its main hyperparameters? How does the choice of distance metric and feature scaling affect its performance? Give an example of a scenario where k‑NN might perform poorly.
The k‑NN algorithm classifies a query point by finding the k closest training examples (according to some distance metric) and taking a majority vote (classification) or average (regression).
- Hyperparameters:
- k (number of neighbours)
- Distance metric (e.g. Euclidean, Manhattan, Minkowski)
- Weighting scheme (uniform vs. distance‑weighted)
- Feature scaling (e.g. standardization, min–max) is crucial: unscaled features with larger ranges dominate distance calculations.
- Distance metric choice affects sensitivity to outliers and feature correlations.
- Poor scenario: High‑dimensional sparse data (“curse of dimensionality”)—neighbours become equidistant, degrading performance.
Write down the hypothesis function and cost function for ordinary least squares linear regression. What assumptions underlie this model? How do you evaluate the quality of a fitted linear regression model?
- Hypothesis:
- Cost (MSE):
- Assumptions:
- Linearity: true relationship is linear in parameters.
- Homoscedasticity: constant variance of errors.
- Independence of errors.
- Normality of error distribution (for inference).
- Evaluation:
- R² (coefficient of determination)
- RMSE or MAE on hold‑out data
- Residual analysis for pattern/heteroscedasticity
Describe how a decision tree makes splits. Define both Gini impurity and information gain. How does tree depth influence bias and variance? What strategies exist to prevent overfitting in trees?
- Splitting: at each node, evaluate all possible feature–threshold pairs, choose the one that maximizes the reduction in impurity.
- Gini impurity for node t:
- Entropy and information gain:
- Depth effect:
- Shallow trees → high bias, low variance
- Deep trees → low bias, high variance
- Prevent overfitting:
- Pruning (pre‑ or post‑)
- Max depth, min samples per leaf constraints
- Ensemble methods (bagging, boosting)
Define overfitting and underfitting in supervised learning. How can you detect overfitting using training and validation errors? List and briefly describe three techniques to reduce overfitting.
- Underfitting: model too simple, high error on both train & validation.
- Overfitting: model too complex, low training error but high validation error.
- Detection: plot training vs. validation error as model complexity grows; a widening gap (low train, rising val) signals overfitting.
- Mitigation techniques:
- Regularization (L1/L2 penalties) to constrain weights
- Early stopping during iterative training
- Simplify model (reduce depth/features) or prune
Explain the purpose of splitting data into training, validation, and test sets. What are typical split ratios? When might you prefer a simple hold‑out split versus k‑fold cross‑validation?
- Training set: fit model parameters.
- Validation set: tune hyperparameters and detect overfitting.
- Test set: unbiased estimate of final performance.
- Typical ratios: 60/20/20, 70/15/15, or 80/10/10 (train/val/test).
- Hold‑out vs. CV:
- Hold‑out: fast, adequate when data is abundant.
- k‑fold CV: more reliable on limited data, reduces variance in performance estimate.
Describe k‑fold cross‑validation. How is the final performance metric computed? What are its advantages and disadvantages compared to a single hold‑out validation? How do you choose k?
- Procedure: partition data into k equal folds; for each fold i, train on k–1 folds, evaluate on fold i; repeat for all i.
- Final metric: average of the k fold scores (e.g. mean accuracy or mean RMSE).
- Advantages:
- More stable, low‑variance estimate
- Utilizes all data for training and validation
- Disadvantages:
- k× more training cost
- Potential data leakage if not stratified for classification
- Choosing k: common values are 5 or 10; larger k gives lower bias but higher computational cost.
Differentiate between classification and regression tasks. Give two real‑world examples of each. What types of model outputs and evaluation metrics are appropriate for each?
- Classification: predict discrete labels.
- Examples: email spam detection; disease diagnosis (healthy vs. sick).
- Outputs: class labels or class probabilities.
- Metrics: accuracy, precision, recall, F1‑score, AUC.
- Regression: predict continuous values.
- Examples: house price estimation; temperature forecasting.
- Outputs: real‑valued predictions.
- Metrics: MSE, RMSE, MAE, R².