Studying Overfitting
Quick Recap: Expectation and Variance of a Random Variable
From a course like CS 200, you may recall that if \(Y\) is a (discrete) random variable with probability mass function \(p(y) = \mathbb{P}(Y = y)\), then the mean of \(Y\) is given by \(\mathbb{E}[Y] = \sum_y y p(y)\) and the variance of \(Y\) is given by
\[ \begin{aligned} \mathrm{var}(Y) = \mathbb{E}[(Y - \mathbb{E}[Y])^2] = \sum_{y} (y - \mu)^2 p(y) \;, \end{aligned} \]
where we’ve defined \(\mu = \mathbb{E}[Y]\) for convenience in the last formula.
(Everything we are going to say and prove in this problem also works for continuous random variables if we replace sums with integrals.)
Problem Setup
Let \(Y\) be a (discrete) random variable with known mean \(\mathbb{E}[Y] = \mu\) and variance \(\mathrm{var}(Y) = \sigma^2\). We want to predict \(Y\) with a single number \(\hat{y}\).
We’re going to measure the success of our prediction \(\hat{y}\) with the mean-squared error, which in this case is just \((\hat{y} - Y)^2\). Since \(Y\) is a random variable, the mean-squared error of our prediction \(\hat{y}\) is also a random variable. To get a single number that measures the quality of our prediction, we take the expectation of the mean-squared error. This is called the risk of our prediction \(\hat{y}\), and is given by
\[ \begin{aligned} R(\hat{y}) = \mathbb{E}[(\hat{y} - Y)^2] = \sum_{y} (\hat{y} - y)^2 p(y) \;. \end{aligned} \]
Here, the sum ranges over all possible values of \(y\).
Part A
Show that the risk of our prediction \(\hat{y}\) can be written as
\[ \begin{aligned} R(\hat{y}) = (\hat{y} - \mu)^2 + \sigma^2 \;. \end{aligned} \]
We can think of the first term as a measure of our prediction’s bias (deviation from the mean \(\mu\) of \(Y\)) and the second term as a measure of the noise level in our target (the inherent variability of \(Y\)).
Part B
Prove that the risk \(R(\hat{y})\) is minimized when \(\hat{y} = \mu\). In other words, in the absence of other information about \(Y\), the best prediction we can possibly make (as measured by the expected MSE \(R\)) is to predict the mean \(\mu\) of \(Y\). What is the lowest possible value of the risk that we can achieve?
Note
You may be interested to know that other choices of the loss function (instead of MSE) lead to different optimal predictions. For example, if we measure the quality of our prediction \(\hat{y}\) with the mean absolute error \(\mathbb{E}[|\hat{y} - Y|]\), then the optimal prediction is the median of \(Y\) instead of the mean.