Studying Overfitting

A common signature of overfitting when using a machine learning model \(f\) with weights \(\mathbf{w}\in \mathbb{R}^d\) is that one or more of the weights \(w_j\) become very large in magnitude in the trained model. For this reason, many strategies for mitigating overfitting involve adding a penalty term to the loss function that encourages weights to be small.

But how do the weights get so large to begin with? In this warmup you’ll consider an extreme (but practically relevant) case in which we have fewer data points than features, i.e. \(n < d\).

Let \(R(\mathbf{w})\) be the mean-squared error for multivariate linear regression with data matrix \(\mathbf{X}\in \mathbb{R}^{n\times d}\), weight vector \(\mathbf{w}\in \mathbb{R}^d\), and target vector \(\mathbf{y}\in \mathbb{R}^n\):

\[ \begin{aligned} R(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (\mathbf{w}^\top \mathbf{x}_i - y_i)^2 = \frac{1}{n} (\left\lVert\mathbf{X}\mathbf{w}- \mathbf{y}\right\rVert)^2\;. \end{aligned} \]

The equation \(\nabla R(\mathbf{w}) = \mathbf{0}\) is a necessary condition for \(\mathbf{w}\) to be a minimizer of \(R\), and is equivalent to the system of linear equations often called the normal equations:

\[ \begin{aligned} \mathbf{X}^\top \mathbf{X}\mathbf{w}= \mathbf{X}^\top \mathbf{y}\;. \end{aligned} \]

Part A

An extreme case in which overfitting often occurs in which we have fewer data points than features, i.e. \(n < d\). Do the normal equations have a unique solution?

Part B

Still assuming that \(n < d\), suppose that we have found a solution \(\hat{\mathbf{w}}\) to the normal equations. Give an algorithm (recipe, approach) for finding another solution \(\hat{\mathbf{w}}'\) to the normal equations.

Part C

Still assuming that \(n < d\), give an algorithm (recipe, approach) for finding a solution \(\hat{\mathbf{w}}\) to the normal equations such that at least one entry of \(\hat{\mathbf{w}}\) is arbitrarily large in magnitude.